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Summary  of  Results 

This  report  describes  the  major  accomplishments  under  the  grant  AFOSR- 
89-0032.  The  main  focus  of  research  was  on  two  major  areas:  study  of  parallel 
architectures,  and  design  of  efficient  parallel  algorithms  for  image  understand¬ 
ing.  In  the  following,  we  summarize  the  research.  The  appendix  includes  copies 
of  publications  supported  under  this  grant. 


1  Memory  Systems  for  Image  Processing 

» 

Communication  between  memory  and  processors  always  has  been  a  bottleneck 
in  high  performance  computer  systems.  Low  memory  bandwidths  can  lead  to  a 
significant  performance  degradation  of  these  systems.  Particularly,  in  real  time 
image  processing  and  vision  applications,  where  the  amount  of  data  movement 
is  enormous,  very  fast  memory  operations  are  needed.  A  new  efficient  memory 
system  has  been  proposed  for  image  processing  and  vision.  In  this  system,  Latin 
Squares  are  adopted  as  the  skewing  scheme.  A  new  type  of  Latin  squares,  per¬ 
fect  Latin  squares  with  special  properties  are  introduced  for  imag^' processing. 
A  simple  construction  method  for  perfect  Latin  squares  is  also  shown.’ The  re¬ 
sulting  memory  system  provides  access  to  various  subsets  of  image  data  (rows, 
columns,  diagonals,  main  subsquares  etc.)  without  memory  conflict.  The  mem¬ 
ory  modules  are  fully  utilized  for  most  frequently  used  subsets  of  image  data. 
The  address  generation  can  be  performed  in  constant  time.  This  memory  system 
is  the  first  known  system  that  achieves  constant  time  access  to  rows,  columns, 
diagonals  and  subarrays  using  minimum  number  of  memory  modules  [5]. 


2  Electro-optical  Architectures 

Arrays  of  processors  with  optical  interconnection  networks  for  fine  grain  image 
processing  are  studied.  It  has  been  shown  that  the  unit  time  interconnection 
medium  made  possible  with  the  use  of  free  space  optics  results  in  efficient  so¬ 
lutions  to  communication  intensive  image  computations.  Using  these  models, 
O(logW)  pointer  based  algorithms  for  several  tasks  in  low  and  medium  level 
vision  are  presented.  We  also  presented  a  generic  solution  that  can  be  used 
to  simulate  PRAM  algorithm  for  image  computation  on  the  proposed  electro- 
optical  arrays  [4]. 
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3  Orthogonal-access  Parallel  Organizations 

We  have  investigated  a  class  of  orthogonal-access  parallel  organizations  for  ap¬ 
plications  in  image  and  vision  analysis.  These  architectures  consist  of  a  mas¬ 
sive  memory  and  reduced  number  of  processors  which  access  the  shared  mem¬ 
ory.  The  memory  can  be  envisaged  as  an  array  of  memory  modules  in  the 
k-dimensional  space,  with  each  row  of  modules -along  a  certain  dimension  con¬ 
nected  to  one  bus.  Each  processor  has  access  to  one  bus  along  each  dimension. 
It  is  shown  that  these  organizations  are  communication-efficient  and  can  provide 
processor-time  optimal  solutions  to  a  wide  class  of  image  and  vision  problems. 
In  the  two  dimensional  case,  the  basic  organization  has  n  processors  and  n  x  n 
memory  arrv'  which  can  hold  n  x  n  image,  and  it  provides  O(n)  time'solution  to 
several  imaj  computations  including:  histogram,  histogram  equalization,  com¬ 
puting  connected  components,  convexity  problems,  and  pomputing  distances, 
etc.  [6]. 
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Abstract 

A  novel  memory  system  is  proposed  for  image  process¬ 
ing.  Latin  squares,  which  are  well  known  combinatorial 
objects  for  centuries,  are  used  as  the  skew  function  of  the 
memory  system.  We  introduce  a  new  Latin  square  with 
desired  properties  for  image  array  access.  The  resulting 
memory  system  provides  access  to  various  subsets  of  im¬ 
age  data(rows,  columns,  diagonals,  main  subsquarcs  etc.) 
without  memory  conflict.  The  memory  modules  are  fully 
utilized  for  most  frequently  used  subsets  of  image  data. 
The  address  generation  can  be  performed  in  constant  time. 
This  memory  system  is  the  first  known  memory  system  that 
achieves  constant  time  access  to  rows,  columns,  diagonals 
and  subarrays  using  minimum  number  of  memory  modules. 

1  Introduction 

With  recent  advances  in  VLSI  and  parallel  processing,  large 
number  of  processors  can  be  interconnected  to  yield  high 
performance  systems  for  image  processing  and  vision.  Most 
image  processing  tasks  not  only  perform  intensive  compu¬ 
tations  but  also  process  enormous  amount  of  data.  Low 
effective  bandwidth  of  the  memory  system  can  lead  to  sig¬ 
nificant  performance  degradation.  In  order  to  fully  utilize 
the  processing  power  available  iti  parallel  systems,  special 
attention  should  be  given  to  memory  organizations  as  well 
as  mapping  of  image  arrays. 

In  image  processing,  an  image  can  be  represented  by  a 
two  dimensional  array  of  size  M  x  .-V.  It  is  well  understood 
that  access  to  row  vectors,  column  vectors,  diagonals  and 
subarrays  arc  required  heavily  in  image  processing(10,17). 
A  memory  system  for  image  piocessing  should  provide  ef¬ 
ficient  access  to  rows,  columns,  diagonals  and  subarrays. 
Also,  while  implementing  partitioned  VLSI  arrays  and  spe¬ 
cial  purpose  arrays,  access  to  various  suhatrays  is  needed. 

Today’s  computers  use  multiple  memory  modules  either 
in  interleaved  memory  form  or  in  paiallel  memory  form. 

'This  research  was  supported  in  |>uit  In  the  Vatn.ii.il  '-cience  Keiin- 
dation  under  grant  IItl-$7IOS9l>  and  in  pail  l.y  A  l‘( |;  under  grant 
AFOSR-89-OO.T2. 


The  effective  bandwidth  of  these  memory  systems  not  only 
depends  on  the  number  of  modules  and  the  speed  of  oper¬ 
ation  but  also  depends  on  the  occurrence  of  memory  con¬ 
flicts,  which  occur  when  more  than  one  memory  request  is 
given  to  the  same  memory  module  during  a  memory  cy¬ 
cle.  Hence,  the  memory  system  should  provide  conflict  free 
access  to  as  many  subsets  ef  interest  as  possible. 

Many  researchers  jia^e  addressed  the  issue  of  design¬ 
ing  memory  systems  which  provide  efficient  access  to  rows, 
columns,  diagonals  or  subarrays|l,2,9,ll,12,lG,17,18,l9]. 
Budnik  and  Kuck  introduced  linear  skewing  scheme  and 
prime  memory  system  which  allows  conflict  free  access  to 
rows,  columns,  diagonals  and  A,1/!  x  Nl,i  subarrays  of  an 
JVxJV  array,  in  which  the  number  of  memory  modules  is  a 
prime  number  greater ’than  W[2,9).  However,  their  method 
needs  modulo  operation  o(,a  prime  number  in  address  gen¬ 
eration,  which  can  not  be  performed  in  constant  time  with 
a  reasonable  amount  of  c^r.cuitry.  Lawrie(ll)  generalized 
the  linear  skewing  schcme  'attd  proposed  an  array  storage 
scheme  with  a  simple  address  generation.  His  scheme  pro¬ 
vides  conflict  free  access  to  rows,  columns,  diagonals  and 
A"/3  x  N1*7  subarrays  of  an  N  x  N  array  using  2Ar  memory 
modules.  He  also  introduced  the  omega  network  to  perform 
data  routing  and  data  alignment.  However,  his  scheme  suf¬ 
fers  from  low  utilization  of  memory  modules.  Only  half  of 
the  memory  modules  arc  used  in  a  memory  fetch.  Mem¬ 
ory  organizations  with  linear  skewing  schemes  suffer  either 
low  memory  utilization  or  difficulty  in  address  generation. 
Another  way  of  achieving  efficient  access  to  various  subsets 
of  interest  is  by  using  nonlinear  skewing  schetncs[l,3,12,16]. 
Lcc[12]  presented  a  nonlinear  skewing  schcme(callcd  scram¬ 
bled  storage  scheme)  and  a  new  network  which  allows  con¬ 
flict  free  access  to  rows,  columns,  square  blocks  and  dis¬ 
tributed  blocks.  The  scrambled  storage  scheme  docs  not 
need  modulo  operations  for  address  generation.  However, 
it  can  not  provide  efficient  access  to  diagonals. 

l'.ven  though  a  lot  of  woik  has  been  done  in  the  con¬ 
text  of  array  storage  schemes  for  general  purpose  parallel 
computers,  not  much  has  been  done  with  respect  to  access 
patterns  needed  for  image  processing  and  vision.  Voorhis 
and  Morriu(l7j  pioposed  a  linear  skewing  scheme  and  an 
address  generation  circuitry  for  image  processing  in  which 
only  one  memory  module  is  not  used  dining  a  memory 
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I'-leli.  I  lirii  noIii-iiii*  |>ii. viili^  n.iillii'l  fire  access  to  urn's, 
columns  i  *  ,il>;iiTiiy>.  However.  tlirir  scheme  requites 
inoiliiln  opeiution  with  a  initiilirr-  yrhirh  is  not  a  power  of 
two,  l.tiiri.  tlirii  iiirtlioil  wsi>  improved  liv  Park;  14;  using 
a  simple!  aihli ess  e.eliel aiion  ciietdlry. 

To  achieve  eonlliet  flee  tieeess  to  various  Milisets  of  data, 
we  piopose  to  use  Latin  squares  which  are  well  known  rum 
liinatoiial  objects  for  cent  nriesj-lj.  We  introduce  a  new 
Latin  .sipiaiefdenoted  |»et feet  Latin  square)  which  has  piop- 
ertics  useful  for  image  pioccssing  and  vision.  We  also  show 
a  simple  construction  method  for  perfect  Latin  squat es.  Us¬ 
ing  perfect  Ltitin  sipiare  as  the  skewing  scheme,  various  sub¬ 
sets  of  image  data  are  accessible  without  memory  eonlliet 
while  memory  modules  are  fully  utilized  for  most  interest¬ 
ing  subsets  of  image  data.  Furthermore,  it  is  possible  to 
show  that  the  address  generation  can  be  done  in  constant 
time. 

In  the  next  section,  we  describe  memory  access  require¬ 
ments  for  image  processing.  In  section  3,  we  introduce  per¬ 
fect  Latin  squares.  We  also  show  a  construction  method  for 
perfect  Latin  squares.  In  section  4,  we  describe  how  perfect 
Latin  squares  can  be  used  for  storing  and  retrieving  image 
data.  Section  fi  concludes  the  paper. 

2  Preliminaries 

A  memory  module  can  be  characterized  by  the  amount  of 
data  it  can  stcre  and  the  cycle  time  which  is  the  mini¬ 
mum  time  between  two  consecutive  read  or  write  oper¬ 
ations.  Clearly,  the  cycle  time  determines  bandwidth  of 
the  memory  system.  To  overcome  the  limitation  imposed 
by  cycle  time,  multiple  memory  modules  are  used  in  to¬ 
day’s  computers.  One  way  of  using  multiple  memory  mod¬ 
ules  is  by  organizing  them  as  parallel  memory  banks.  In 
this  memory  system,  multiple  memory  banks  arc  accessed 
at  the  same  time,  therefore,  the  effective  memory  band¬ 
width  is  multiplied  by  t he  number  of  memory  banks.  This 
scheme  is  commonly  used  in  SIMD  or  MIMD  computers. 
Another  way  of  using  multiple  memory  modules  is  by  orga¬ 
nizing  them  as  interleaved  memories  which  arc  commonly 
employed  in  pipelined  vector  computers.  In  this  memory 
system,  memory  access  mechanism  is  divided  into  several 
stages  and  used  in  a  pipelined  manner.  Cray  X-MP  uses 
a  3'2-way  interleaved  memory  system.  Some  memory  sys¬ 
tems  employ  both  parallel  memory  banks  and  interleaved 
memories.  In  such  a  system,  each  parallel  memory  hank  is 
composed  of  interleaved  memories. 

liven  if  multiple  memory  modules  arc  used  to  increase 
(he  effective  bandwidth  of  the  memory  system,  memory 
eonlliet  can  greatly  reduce  the  effective  bandwidth.  For 
an  extreme  example,  if  all  data  elements  to  he  accessed  sue 
Moled  in  the  same  memory  module,  the  effective  bandwidth 
is  same  as  using  only  one  memory  module  irrespective  of 


tin  iiimibe;-  <■!  memory  modules  ill  the  memory  system. 

Itudmk  and  Kink]'.!!  showed  that,  in  general'  memory 
< outle  ts  c.m  not  lie  avoided  for  arbitrary  data  access  using 
a  limited  numliei  of  memory  schemes.  However,  if  a  limited 
liumhei  ol  subsets  of  data  me  Used  ftequeiilly.  ail  ellicient 
memory  scheme  van  lie  provided  for  access  to  the  subsets 
of  iuteiest  lesultitig  in  high  effective  bandwidth. 

(n  image  processing,  row  vectors,  column  vectors,  sub- 
at  rays  and  diagonalsfor  diagonals  of  subarrays)  are  heavily 
Used.  The  importance  of  row  access  is  obvious  considering 
that  image  inputs  and  outputs  arc  usually  in  tow  major 
ordcrflcft  to  right,  top  to  bottom).  There  are  also  many 
algorithms  which  need  row  accesses  like  one  dimensional 
convolution  algorithms.  Desirability  for  column  access  is 
justified  by  the  commonly  used  90  degree  rotation  of  image 
and  by  algorithms  which  need  column  accesses  like  the  two 
dimensional  FFT  algorithms.  Subarrays  play  an  impor¬ 
tant  role  in  two  dimensional  template  matching  and  other 
algorithms.  Partitioning  is  an^  essential  step  in  mapping 
algorithms  to  fixed  size  VLSI  arrays|l3).  Access  to  sub¬ 
arrays  of  image  data  is  needed  in  such  implementations. 
Diagonals  or  diagonals  of  subarrays  arc  also  useful  in  ma¬ 
trix  computation  which  is  an  indispensable  tool  in  image 
processingjlOj. 

Prom  the  above  arguments,  it  is  obvious  that  a  memory 
system  for  image  processing  should  provide  efficient,  hope¬ 
fully  conflict  free,  access  to  row  vectors,  column  vectors, 
subarrays,  and  diagonals  for  ijnage  processing. 

In  this  paper,  we  restrict  our  attention  to  the  cases 
where  the  number  of  memory,  modules  is  an  even  power 
of  two.  The  desirability  of  hading  pu tuber  of  memory  mod¬ 
ules  to  be  an  even  power  of  two  is  clear  in  terms  of  address 
generation,  utilization  of  address  space  and  interconnection 
network. 

3  Latin  Squares  for  Memory  Ac¬ 
cess 

A  l.nlin  square  of  order  n  is  an  n  x  n  square  composed 
wit  It  symbols  from  0  to  n  -  1  such  that  uo  symbol  appears 
more  than  oucc  in  a  row  or  in  a  columnjd).  The  rows  are 
numbered  from  0  to  n  —  1,  top  to  bottom.  The  columns  are 
also  numbered  from  0  to  n  —  1,  left  to  right.  The  square  ,4 
shown  below  is  an  example  of  Latin  square  of  order  4. 

0  12  3' 

12  3  0 
A  "  2  3  0  1 

3  0  1  2 

A  iluKjonal  l.nlin  square  of  order  n  is  a  Latin  square  such 
that  no  symbol  appears  more  than  once  in  any  of  its  two 
main  diagonals.  A  simple  construction  m-tliod  is  known 
lor  diagonal  Latin  squares  of  order  n  when  it  is  a  power 
of  two'  l  .  (lergelv  showed  a  general  method  to  .'oust met 
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diagonal  l.:ii in  squates  of  -my  oidci  «i  *«  Tin'  square  It 
shown  below  is  ;iu  example  if  •  lingon,.!  Lalm  square  of  order 

f  »  I  2  ■»  1 


[  I  0  :<  j 

Two  Latin  squares  (’  and  I)  of  oidet  ii  .ire  orthogonal  to 
each  other  if  the  set  of  ordered  |);iirs 
CD  —  0  i.j  n  1}  is  equal  to  the  set 

of  all  possible  ordered  pairs  (,»,_/ ),  (I  v  i,  j  <  n  -  1.  The 
squares  C  and  D  shown  below  are  orthogonal  to  each  or- 
liter. 


0  1  2  ' 

'0  1  2 

= 

1  2  0 

1)  - 

2  0  1 

.20  1  . 

.1  2  0  . 

A  self-orthogonal  l.atin  square  is  a  Latin  square  orthogonal 
to  its  transpose.  We  define  a  doubly  self-orthogonal  Latin 
square  as  a  Latin  square  orthogonal  to  its  transpose  and 
to  its  antitranspose.  Notice  that  a  doubly  self-orthogonal 
Latin  square  is  also  a  diagonal  Latin  square.  The  square  B 
shown  above  is  also  an  example  of  doubly  self-orthogonal 
Latin  square.  We  define  a  subsguarc  S,.t  of  a  Latin  square 
of  order  n5  as  an  n  x  n  square  whose  top  left  cell  has  the 
coordinate  {i,j ).  When  i  -. a  0  mod  n  and  j  5  0  mod  «,  the 
subsquare  S-,j  is  called  a  main.subsquarc.  Subsquare  $o.i 


of  the  square  B  is 


1  2 
3  0 


and  main  suhsquare  5j.o  of  the 


square  B  is 


3  2 
1  0 


We  define  a  perfect  Latin  square  of 


order  nJ  as  a  diagonal  Latin  square  of  order  tr  such  that 
no  symbol  appears  more  than  once  in  any  main  subsquare. 
Hence,  in  a  perfect  Latin  square,  no  symbol  appears  more 
than  once  in  any  row,  in  any  column,  in  any  diagonal  or 
in  any  main  subsquare.  The  square  E  shown  below  is  a 
perfect  Latin  square  of  order  9. 


'  0 
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1 

•1 
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•> 
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3 
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In  the  rest  of  this  section,  we  present  a  construction 
method  for  perfect  Latin  squares  of  order  ;r  for  the  case 
n  =  21,  k  (■  {1.2,3...}.  The  interested  leader  can  refer 
to  (8)  for  the  construction  of  perfect  Latin  squares  of  order 
n2  where  n  is  odd  or  n  ~  '2'tii2,  I  i  {2, 3.  I _ },in  is  odd. 

For  the  construction  of  perfect  Latin  squares,  we  start 
with  the  following  lemma. 


Lcmiiiii  3.1  If  thru  in  .-I  tiro  doutdg  .»<  If ■  orthogonal  I, ah  n 
sgium  s  .-1  of  ordtrm  and  II  of  oidir  ii.  then  then  <  lists  a 
doubly  n  If-orthoyonnl  Latin  m/ihim  (  '  of  ordir  inn. 

\  '  « 

Proof:  Fioiii  the  doubly  self-oithogonal  Latin  squates  .-1 
and  B,  we  const  met  a  squaie  C  of  order  inn  with  the  fol¬ 
lowing  rule: 

*  i.j  "  **  "  !  ^M.u‘il.1,1  im< >.i -i ,  II  S  i,  j  mu  •-  1 

where  |i/nj  represents  the  largest  integer  not  exceeding  i/n 
and  i  mod  n  represents  the  nonnegative  remainder  of  i/n. 
The  construction  of  C  can  be  viewed  as  follows: 


1 .  Construct  a  square  of  order  run  using  m2  B's.  Let 
B,j  be  the  square  B. whose  position  is  i  -f- 1  from  top 
and  j  +  1  from  left. 


A 

2.  Add  u  x  n,  j  to  aU  members  of  Bj, . 

1.  * 

An  example  is  shown  below  for  the  case  m  —  n  =  2*. 
» 


.1  =  B  = 

v 
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3  2  10 
10  3  2 
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From  the  construction,  it  is  clear  that  no  symbol  appears 
more  than  once  in  any  row  or  in  any  column.  Suppose 
C  is  not  orthogonal  to  its  transpose.  There  should  exist 
i,j,k  and  /,  f  ■/-  /„■  or  j  ■£  l,  such  that  c, ,,  =  c*,i  and 
<b.i  —  cj.fc.  Since  0  <  <  in  —  1,  0  <  p,q  <  wt  -  1,  and 


0  5  br.t  <  n  -  1 ,  0  <  r,  .<  <  n  -  1 ,  from  e,±J 

=.  ci,j,  we  have 

(1) 

W 

From  c;  i  -  <•/.«,.  we  have 

"b/'d.t'/oi  '■  I'.H/riJ 

(3) 

b(  1111.(1.1,1111011.1  b  1 a ■  i\ n ,( in..  Jr. 

(•1) 
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(I)  (I)  .  ..nliadi.  t  I  In*  assumption  dial  .1  .md  II  are  doubly 
self  ii|tlio">,.n.d  I  .lit  itt  squares.  Ilejice.  (‘  is  oitliugouul  In 
its  transpose.  I’lii-  sai in-  aigimieut  can  In-  applied  in  ('  ami 
ils  aillilianr.|>iiM-.  Therefore.  ('  is*#  doubly  self-ni thogonal 
Latin  sqtiaie.  d 

Now.  ivc-  arc  ready  for  tlu-  following  lemma. 

Lctnmn  .1.2  For  till  k  e  {2, 2,-1,. . .},  there  crisis  a  doubly 
silf.ortboyoiiul  Latin  si/uurc  l)k  of  order  2*. 

Proof:  We  use  induction  on  k. 

liases:  F»»r  k  2  and  k  —  2,  the  following  squares  are 
doubly  self-orthogonal  Latin  squares  of  order  2*  and  23.  1) 3 

3  0  1  7  6  5  4' 

1  2  3  5  4  7  6 

7  4  5  3  2  1  0 

5  6  7  1  0  3  2 

2  1  0  6  7  4  5 

0  3  2  4  5  6  7 

6  5  4  2  3  0  1 

4  7  6  0  1  2  3 

Hypothesis:  There  exists  a  doubly  self-orthogonal  Latin 
square  Dk  of  order  2k. 

Induction  Step:  We  construct  a  doubly  self-orthogonal 
Latin  square  Dk+i  of  order  2k+J,  using  D 3  and  Dk,  whose 
existence  is  proven  in  the  previous  lemma.  □ 

Notice  that  D 2  in  the  above  lemma  is  also  a  perfect 
Latin  square.  Now,  we  present  the  main  theorem  for  the 
construction  of  perfect  Latin  squares. 

Theorem  3.1  For  all  k  S  {1,2,3,...},  there  exists  a  per¬ 
fect  Latin  square  P2k  of  order  2  ,  which  is  also  o  doubly 
sclf-orthoyonal  Latin  square. 

Proof:  When  k  =  1,  P2  =  D2.  When  k  >  2,  from  the 
doubly  self-orthogonal  Latin  square  Dk  of  order  2k,  we  con- 
s-.i  ct  a  Latin  square  C'2k  of  order  2ik  such  that 


can  be  found  in  [Ij. 


0  12  3 

2  3  0  1 

3  2  10 
10  3  2 


D3  - 


c;k,  »  2‘  x  4-  Ll!.,--  0  <  i.j  <22k-\ 


Notice  that  the  construction  method  for  C2k  is  the  same 
as  the  construction  method  appeared  in  lemma  3.1.  From 
C'k,  we  construct  perfect  Latin  square  P2k  of  oreder  22k 
such  that 

2k  ,2k 

Pl.j  "  <-'5‘/irrod2‘+|i/S,|j- 

Notice  that  P2k  is  obtained  from  C'2k  by  a  row  exchange 
operation  such  that  the  lust  k  most  significant  bits  of  i  are 
swapped  with  the  k  least  significant  bits  of  i,  i.e..  the  rows 
are  shuffled  k  times.  As  an  example,  we  build  a  perfect 
Latin  square  of  order  2'.  .Since  the  square  A  in  lemma  3.1 
is  the  doubly  self-orthogonal  Latin  square  I)2,  the  matrix 
('  in  lemma  3.1  is  exactly  the  intermediate  matrix  ("  we 
want .  We  get  a  perfect  Latin  square  P'  by  t  he  row  exchange 

operation  on  ( 


a  t  :•  :t 
>  'I  III  u 

I  .’  I  t  I  t  I  '. 

•  i«  7 

.t  0  I 

10  II  S  <1 

It  l.'i  If  1.1 

0  7  -I  It 

It  2  I  0 

II  IV  !l  8 

15  II  III  12 

7  0  It  -I 

I  ‘  ~  II  3  “  2 

9  8  It  10 

1.1  12  15  U 

5  -I  7  G 


t  5  ll  7 

12  l.t  It  15 

8  9  III  II 

(I  I  2  It 

li  7  •!  5 

1-1  II.  12  lit 

HI  II  8  9 

2  .2  II  I 

T  c.  i 
15  It  III  12 

II  III  9  8 

3  _2  1_  0 

5  *t  7  ft 

13  12  15  l-l 

9  8  10  II 

10  3  2 


8  9  III  II 

(I  I  2  3 

■I  5  0  7 

If  13  It  II. 
Ill  II  S  9 

2  3  II  I 

11  7  I  /. 

II  15  12  13 
It  III  '  9  8 

3  2  10 

7  f.  5  I 

I5_  1-1 13  12 

9  *S  n  To 

10  3  2 
5  -I  7  G 

13  12  15  1-t 


12  13  M  II. 

•I  I.  11  7 

(I  I  2  3 

8  9  III  II 

|.|  II.  12  13 

li  7  I  5 

2  3  0  I 

HI  II  8  9 

15  I  I  13  12  "| 

7  ft  3  -I 

3  2  10 

II  JO  _9  8 

13  12”  II.  if 

5  -t  7  f. 
10  3  2 

9  8  II  10 


/>• 


From  tiie  construction,  it  is  clear  that,  in  the  square  P2k,  no 
symbol  appears  more  than  once  in  any  row,  in  any  column 
or  in  any  main  subsquarc.  Suppose  P2k  is  not  orthogonal 
to  its  transpose.  There  should*  be  i,  j,  m  and  n,  i  qb  m 
or  j  £  11,  such  that  p2kj  =-p2k'n  and  p}k  =  p2km.  From  the 
construetion,  p2k .  0  <  i,j  <  22k  -  1  can  be  expressed  as 


Pu>  =  *  <tfim>d!*.|j/!‘|  +  ^f./2‘].Jn.od:‘ 

Since  0  <  dl  <  2k  -  1,  0  <  t,,  <  2k  -  1,  from  p2) 
we  have  . 

=  PH.  „> 

^(i/2‘).jmo<12‘  T  ^[m/2‘).nmod}‘ 

(5) 

^imod2fc.b/2*]  ^n,n.od3fc,[n/2*| 

(6) 

From  p2k  -  p2km,  wc  have 

d\]/7k\smtni7K  ^[n/2*).tnmod2fc 

(7) 

d)  m,,dj‘.|i/2‘)  =  ^modJ‘,|m/2‘) 

(8) 

(5)-(8)  contradict  the  assumption  that  Dk  is  a  doubly  self- 
orthogonal  Latin  square.  Hence  P2k  is  orthogonal  to  its 
transpose.  The  same  argument  can  be  applied  to  P2k  and 
its  antitranspose.  Hence,  P2k  is  a  doubly  self-orthogonal 
Latin  square  and  no  symbol  appears  more  than  once  in  any 
of  its  two  main  diagonals.  Hence,  P2k  is  a  perfect  Latin 
square  of  order  22k.  a 

4  Parallel  Access  to  Image  Array 

In  this  section,  we  show  in  detail  how  to  use  perfect  Latin 
squares  in  storing  and  retrieving  image  data.  Wc  also  show 
the  properties  of  perfect  Latin  squares  built  from  our  con¬ 
struction  method  which  arc  useful  for  image  data  storage. 

Suppose  wc  have  two  dimensional  image  data  /(A/,;V), 
where  M  and  N  are  multiples  of  2:t  and  wc  also  have  22k 
memory  modules.  Then,  we  can  use  the  perfect  Latin 
square  P2k  as  the  skewing  scheme.  The  image  element 
l  [i.j )  will  be  stored  in  the  memory  module 
Figuie  I  shows  an  example  for  the  case  M  ■  1’2, iV  --  16 
and  k  1 ,  i.e.,  we  are  using  -1  memory  modules  to  store 
image  data  of  size  12  y  16. 


657 


*  ' 


■  (1 

i 

;t 

2 

0 

i 

:t 

■» 

ti 

1 

•t 

•» 

ti 

1 

:t 

•> 

2 

:t 

l 

(1 

■i 

:t 

t 

n 

2 

:t 

1 

ti 

*» 

:t 

1 

(1 

1 

n 

•> 

:t 

i 

ti 

2 

:t 

1 

ti 

2 

:s 

1 

11 

*» 

:t 

:t 

2 

ti 

i 

;t 

2 

(1 

i 

:t 

2 

ti 

1 

-1 

ti 

1 

ti 

1 

:t 

2 

"ti 

T 

:t  ’ 

”2 

ti 

T 

:>r 

2’ 

Ti 

T" 

;t 

2 

2 

:t 

t 

tl 

2 

:t 

i 

ti 

2 

:t 

1 

0 

•» 

•» 

1 

0 

1 

ti 

2 

:t 

1 

0 

2 

:t 

1 

ti 

■1 

:i 

1 

ti 

2 

;t 

;t 

2 

11 

i 

;t 

2 

0 

1 

:t 

•i 

(1 

1 

;t 

•1 

ti 

1 

0 

1 

~:t 

2 

ti 

T 

;t 

■> 

11 

1 

*3 

•> 

“<r 

~T~ 

3 

2 

2 

:t 

i 

0 

2 

:t 

l 

ti 

2 

:< 

1 

ti 

2 

:t 

l 

0 

l 

0 

2 

:t 

1 

0 

2 

;t 

1 

ti 

2 

:t 

1 

0 

2 

3 

2 

0 

t 

:t 

2 

0 

l 

:t 

2 

0 

1 

3 

2 

(1 

1 

Figure  1:  Memory  assignment  of  12  x  Hi  array  into  -1  mem¬ 
ory  modules. 


In  figure  1,  it  is  easy  to  sec  that  every  1  x  4  rows  and 
every  4x1  columns  are  accessible  without  memory  con¬ 
flict.  2x2  main  subarrays  are  also  accessible  without 
memory  conflict.  Data  on  the  diagonals  of  4  x  4  perfect 
Latin  squares  are  also  accessible  .without  memory  conflict. 
These  properties  come  directly  from  the  definition  of  per¬ 
fect  Latin  squares.  However,  figure  1  docs  not  fully  demon¬ 
strate  the  properties  cf  the  perfect  Latin  squares  built  from 
our  construction  method,  because  we  are  using,  as  the  skew¬ 
ing  scheme,  P 1  which  is  not  built  from  our  method.  The 
perfect  Latin  squares  built  from  our  construction  method 
( Pik,k  >  2)  have  other  properties  which  arc  very  useful  in 
image  processing.  It  is  easy  to  verify  the  following  property 
satisfied  by  perfect  Latin  square  of  order  2lk,  k  >  2,  built 
from  theorem  3.1. 

Lemma  4.1  No  symbol  appears  more  lha n  once  in  a  sub- 
square  S,,j  such  that  i  =  0  mod  2*r  or  j  =  0  mod  2*. 

The  above  lemma  shows  that  not  only  the  main  sub¬ 
squares  are  accessible  without  memory  conflict  but  also  ev¬ 
ery  subsquares  on  the  vertical  or  horizontal  st rips  of  width 
2*  are  accessible  without  memory  conflict. 

It  is  easy  to  see  that  there  does  not  exist  a  Latin  square 
such  that  no  symbol  appears  more  than  once  in  any  sub¬ 
square,  which  means  there  is  no  skewing  scheme  which 
guarantees  conflict  free  access  to  rows,  columns  and  N't*  x 
N't1  subarrays  when  N  memory  modules  arc  used  to  store 


an  ,Y  .  „Y  array.  However,  we  have  I  lie  following  lemma 
which  assuies  that  the  maximum  degiee  of  memory  con 
Diet  is  two  when  an  arbitrary  sub->i|iiaie  is  accessed, 

\ 

Leliiiim  4.2  Vo  symbol  oppuns  mint  /hint  hco  Inins  in 
tiny  subs  ilium  SKJ, 

Lemma  4.2  is  a  direct  lesnlt  of  lemma  4.1  because  every 
suhsqiiure  can  be  partitioned  into  two  parts  such  that  each 
of  them  belong  to  a  subsqtiare  on  a  strip  of  width  2*. 

Another  subset  of  possible  interest  in  image  processing 
is  the  subset  of  elements  whose  coordinates  are  the  same 
within  .each  main  subsquarc.  This  subset  can  be  used  in 
algorithms  which  need  a  value  within  each  main  subarray. 
Let  a  same  position  S P,tJ  of  a  perfect  Latin  square  .4  of 
order  n1  be  defined  as  follows, 

SPU)  —  {n*.i  |  k  =  i  mod  n,l  ~  j  mod  ft} ,  0  <  i.j  <n~  1. 

f 

Now,  we  have  t  lit  following  lemma  which  shows  no  memory 
conflict  occurs  when  a  same  position  is  accessed. 

Lemma  4.3  No  symbol  appears  more  than  once  in  any 
same  position  $ Pit). 

\ 

5  Conclusion 

A  new  efficient  memory  sysiem  for  image  processing  and 
vision  is  described  in  this  paper.  Latin  squares  have  been 
adapted  as  the  skewing  schemfr.  ‘Be r feet  Latin  squares  which 
arc  Latin  squares  with  special  properties  useful  for  image 
processing  have  been  introduced.  A  simple  construction 
method  for  perfect  Latin  squares  is  also  shown.  Using  per¬ 
fect  Latin  squares  as  the  skewing  scheme,  many  interesting 
subscls(rows,  columns,  diagonals,  main  subarrays  etc.)  of 
image  data  can  be  accessed  without  memory  conflict. 

In  this  memory  system,  it  is  possible  to  show  that  the 
address  generation  is  very  easy  for  both  memory  module 
address  and  local  address  within  memory  modules.  Mem¬ 
ory  module  address  can  be  generated  in  constant  time  with 
a  simple  logic  circuitry  and  local  address  generation  does 
not  need  any  additional  hardware. 

The  efficient  skewing  scheme  along  with  the  fast  address 
generation  circuitry  makes  it  po,  ble  that  rows,  columns, 
diagonals,  same  positions,  and  subarrays  can  be  accessed 
in  constant  lime.  This  memory  system  achieves  this  goal 
using  minimum  number  of  memory  modules. 

Table  1  shows  several  known  memory  schemes  along 
with  the  scheme  proposed  in  this  paper.  Comparisons  are 
based  on  storing  N  x  N  data.  An  address  generation  is 
considered  to  be  hard  if  it  requires  modulo  operations  of  a 
number  which  is  not  a  power  of  two. 
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Table  1:  Comparison  of  memory  systems  (Blocks  include 
square  blocks  and  distributed  blocks.  Square  blocks  are 
same  as  main  subsquares  and  distributed  blocks  are  same 
as  same  positions  discussed  in  this  paper.) 
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Abstract 

In  this  paper,  we  consider  o  set  of  array  proces¬ 
sors  with  optical  interconnection  networks  for  fine 
grain  image  processing.  The  unit  time  interconnec¬ 
tion  medium  made  possible  with  the  use  of  free  space 
optics,  results  in  efficient  solutions  to  communication 
intensive  image  computations.  Using  these  models, 
we  present  a  summary  of  our  results  in  designing 
Oflog  N)  pointer  based  algorithms  for  several  tasks  in 
low  and  medium  level  vision.  We  also  show  a  generic 
subroutine  that  can  be  used  to  simulate  PRAM  al¬ 
gorithms  for  image  computations  on  the  proposed 
electro-optical  arrays. 

1  Introduction 

During  past  decade,  several  mesh  based  VLSI  archi¬ 
tectures  have  been  considered  for  fine  grain  image 
computations  [13,  12,  1,  14],  The  underlying  inter¬ 
connection  medium  of  these  designs  has  a  significant 
impact  on  the  performance  of  the  desired  parallel  al¬ 
gorithms.  For  example,  solving  many  image  process¬ 
ing  problems  on  a  N  x  N  array  of  processors  takes 
as  much  as  O(N)  time  (12).  This  can  be  reduced  to 
a  logarithmic  time  complexity  for  a  limited  class  of 
problems  when  organizations  such  as  pyramids,  mesh 
of  trees,  or  reconfigurable  ineshes  is  used  (13,  14,  11). 

The  Parallel  Random  Access  Machine  (PRAM) 
is  an  abstract  shared  memory  model  which  has  been 
employed  in  designing  parallel  algorithms  for  compu¬ 
tational  geometry  and  computer  vision,  among  oth¬ 
ers.  The  major  drawback  of  this  model  is  in  its  un¬ 
realistic  unit  time  communication  assumption.  The 
unit  time  simulation  of  a  A  processor  PRAM  using 

*Thi*  research  was  supported  in  part  l>y  the  National  Sci¬ 
ence  Foundation  under  grant  IIU-871'J8:it>  and  by  AKOSR  un¬ 
der  grant  AFOSR-89-0032 


electronic  interconnects  is  not  possible  unless  one  al¬ 
lows  unbounded  fan-in  and  or  fan-out  for  each  of  the 
processors.  In  fact,  a  lower  bound  on  the  time  to 
simulate  one  step  of  a  PRAM  on  any  bounded  de¬ 
gree  network  of  N  nodes  using  electrical  intercon¬ 
nects  is  O(iogTV).  Even  though  several  networks  have 
been  designedLto  efficiently  simulate  PRAM  [7],  many 
cost/performance  issues  limit  their  potential  applica¬ 
tion  in  high  performance  parallel  systems. 

Recently,  there  has  been  an  emerging  interest  in 
using  optical  communications  in  parallel  processing. 
The  replacement  of  tlie  electrical  interconnects  with 
optical  beams  has  a  significant  impact  on  the  per¬ 
formance  of  parallel  computing  systems  implemented 
using  VLSI  technology  [4,‘  10].  This  is  due  to  the  fol¬ 
lowing  two  important  properties  of  free  space  optics. 
First,  free  space  optical  ‘.beagis  can  cross  each  other 
without  any  interference.  Also,  the  connections  need 
not  be  fixed  and  can  be  redirected  [3],  This  implies 
that  using  optical  interconnects,  one  can  design  area 
efficient  bounded  degree  VLSI  architectures  that  can 
simulate  a  unit  delay  interconnection  network. 

In  this  paper,  we  consider  a  class  of  parallel  archi¬ 
tectures  which  use  free  space  optical  beams  as  means 
of  interprocessor  communications.  The  organization 
of  these  networks  is  based  on  i.  proposed  generic  par¬ 
allel  model  of  computation.  This  model  called  OMC, 
allows  unit  delay  interconnects  and  can  efficiently 
simulate  PRAM  algorithms.  YVe  introduce  possible 
physical  realizations  of  this  model.  Each  of  the  pro¬ 
posed  designs  reflects  the  capabilities  and  limitations 
of  the  device  technology  used  for  the  redirection  of 
optical  beams.  Using  these  models,  we  present  effi¬ 
cient  pointer  based  algorithms  for  finding  geometric 
properties  of  digitized  images,  and  parallel  implemen¬ 
tation  of  iterative  methods  used  in  high  level  vision 
processing.  YVe  will  also  present  a  simple  O(logAf) 
algorithm  for  simulating  each  step  of  an  N  proces¬ 
sor  PRAM  on  these  models  using  0(At/logA)  pro- 
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cessors.  This  chii  be  used  ns  a  genetic  subroutine 
in  translating  PRAM  algorithms  for  image  computa¬ 
tions.  ^ 

Tlie  rest  of  the  paper  is  organized  os  follows. 
In  the  next  section,  we  discuss  an  optical  model  of 
computation,  and  present  three  physical  implemen¬ 
tations  of  the  model.  In  section  3,  we  study  a  generic 
subroutine  for  simulating  PRAM  algorithms.  A  set 
of  parallel  algorithms  for  image  computations  using 
electro-optical  arrays  is  shown  in  section  4. 

2  Electro  Optical  Arrays 

In  the  first  part  of  this  section,  we  define  an  Optical 
Model  of  Computation  (OMC)  which  is  an  abstrac¬ 
tion  of  currently  implementable  optical  and  electro- 
optical  computers.  Similar  to  the  VLSI  model  of 
computation  [18],  this  generic  model  can  be  used  to 
understand  the  limits  on  computational  efficiency  in 
using  optical  technology.  In  the  second  part  of  this 
section,  we  present  possible  physical  realizations  of 
the  optical  interconnection  network  defined  by  OMC. 
Each  of  the  proposed  designs  reflects  the  capabilities 
and  limitations  of  the  device  technology  used  for  the 
redirection  of  optical  beams. 

2.1  A  Generic  Model 

OMC  is  defined  as  follows: 

Definition  1  An  optical  model  of  computation  repre¬ 
sents  a  network  of  N  processors  each  associated  with 
a  memory  module,  and  a  deflecting  unit  capable  of  es¬ 
tablishing  direct  optical  connection  to  another  proces¬ 
sor.  The  interprocessor  communication  is  performed 
satisfying  the  following  rules  similar  to  [2j: 

1.  At  any  time  a  processor  can  send  at  most  one 
message.  Its  destination  is  another  processor. 

2.  The  message  will  succeed  in  reaching  the  proces¬ 
sor  if  it  is  the  only  message  with  that  processor 
as  its  destination,  at  that  time  step. 

3.  All  messages  succeed  or  fail  (and  thus  are  dis¬ 
carded)  in  unit  time. 

To  insure  that  every  processor  knows  when  its 
message  succeeds  we  assume  that  the  OMC  is  run 
in  two  phases.  In  the  first  phase,  read/write  mes¬ 
sages  arc  sent,  and  in  the  second,  values  are  returned 
to  successful  readers  and  acknowledgements  are  re¬ 
turned  to  successful  writers.  We  assume  that  the  op¬ 
eration  mode  is  synchronous,  and  all  processors  are 


connected  to  a  central  control  unit.  The  above  defi¬ 
nition  is  supplemented  with  a  set  of  assumptions  for 
an  accurate  analysis  in  (Gj. 

Using  those  assumptions,  the  following  space 
time  trade-off  can  be  shown: 

AT  =  n(/) 

where  A  is  the  area  occupied  by  the  processing  layer, 
and  the  time  (T)  required  to  solve  a  problem,  is  the 
number  of  cycles  necessary  to  exchange  the  minimum 
required  information  (/).  When  compared  with  three 
dimensional  VLSI  model  of  computation  the  following 
proposition  can  be  stated: 

Proposition  1  Any  computation  performed  by  a 
three  dimensional  VLSI  organization  having  N  pro¬ 
cessors  with  degree  d,  '.in  'time  T,  and  volume  V  can 
be  performed  on  OMC  in  volume  v,  and  time  t,  where 
dT/N  <t<T,  and  Nd  <  v. 

In  the  following  section  three  different  parallel 
architectures  are  presented  as  possible  efficient  upper 
bounds  for  v. 

2.2  Physical  realizations 

In  this  section,  we  present  jo,  class  of  optical  intercon¬ 
nection  networks  as  a  realization  of  the  OMC  pre¬ 
sented  in  the  previous  section.  Each  of  the  proposed 
designs  uses  a  different  optical  device  technology  for 
redirection  of  the  optical  beams  to  establish  a  new 
topology  at  any  clock  cycle,  and  represents  an  upper 
bound  on  the  volume  requirement  of  OMC. 

2.2.1  Optical  Mesh  Using  Mirrors 

In  this  design,  there  are  N  processors  on  the  process- 
;ng  layer  of  area  N.  Similarly,  the  deflecting  layer  has 
area  N  and  holds  N  mirrors.  These  layers  are  aligned 
so  that  each  of  the  mirrors  is  located  directly  above  its 
associated  processor.  Each  processor  has  two  lasers. 
One  of  these  is  directed  up  towards  the  arithmetic 
unit  of  the  mirror  and  the  other  is  directed  towards 
the  mirror’s  surface.  A  connection  phase  would  con¬ 
sist  of  two  cycles.  In  the  first  cycle,  each  processor 
sends  the  address  of  its  desired  destination  processor 
to  the  arithmetic  unit  of  its  associated  mirror  using 
its  dedicated  laser.  The  arithmetic  unit  of  the  mirror 
computes  a  rotation  degree  such  that  both  the  ori¬ 
gin  and  destination  processors  have  equal  angle  with 
the  line  perpendicular  to  the  surface  of  the  mirror  in 
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the  plane  formed  by  the  mirror,  the  source  proces¬ 
sor,  and  the  destination  processor.  Once  the  angle  is 
computed,  the  mirror  is  rotated  to  point  to  the.  de¬ 
sired  destination.  In  the  second  cycle,  connection  is 
established  by  the  laser  beam  carrying  the  data  from 
the  source  to  the  mirror  und  from  the  mirror  being 
reflected  towards  the  destination.  Since  the  connec¬ 
tion  is  done  through  a  mechanical  movement  of  the 
mirror,  with  the  current  technology  this  leads  to  an 
order  of  milli-seconds  reconfiguration  time.  Therefore 
this  architecture  is  suitable  for  applications  where  the 
interconnection  topology  does  not  have  to  be  changed 
frequently. 

The  space  requirement  of  this  architecture  is 
O(N)  under  the  following  assumption.  Each  mirror  is 
attached  to  a  simple  electro-mechanical  device  which 
takes  one  unit  of  space  and  can  rotate  to  any  posi¬ 
tion  in  one  unit  of  time.  With  current  technological 
advancements,  the  above  assumption  may  not  be  the 
most  accurate  one  but  we  do  not  see  any  technological 
limitations  preventing  us  from  making  our  assump¬ 
tions.  In  fact,  our  assumptions  are  as  valid  as  those  in 
VLSI;  the  constant  propagation  assumption  re¬ 
gardless  of  wire's  length.  Other  assumptions  can  also 
be  made  based  on  the  following  arguments.  Many 
mirrors  have  a  reconfiguration  delay  proportional  to 
their  rotation  angle,  O(N).  More  complex  mirrors  on 
the  other  hand,  can  rotate  faster  for  a  larger  angle 
(unit  time  rotation  delay  )  but  their  size  can  grow 
proportional  to  the  number  of  angles  they  can  realize 
(  0(N)  ). 

2.2.2  Electro-optical  Linear  Array 

In  this  organization,  N  processors  are  arranged  to 
form  a  one-dimensional  processing  layer  and  the  cor¬ 
responding  acousto-optics  devices  are  similarly  lo¬ 
cated  on  a  one-dimensional  deflecting  layer.  The  size 
of  each  of  the  acousto-optic  devices  is  proportional  to 
the  size  of  the  processing  array,  leading  to  an  0(N2) 
area  deflection  layer.  Similar  to  the  design  using  the 
mirrors,  every  processor  has  two  lasers,  and  each  con¬ 
nection  phase  is  made  of  two  cycles.  In  the  first  cy¬ 
cle,  each  processor  sends  the  address  of  its  desired 
destination  processor  to  the  arithmetic  unit  of  its  as¬ 
sociated  acousto-optic  unit  using  its  dedicated  laser 
beam.  The  acousto-optic  cell's  arithmetic  unit  com¬ 
putes  the  frequency  of  the  wave  to  be  applied  to  the 
crystal  for  redirection  of  the  incoming  optical  beam  to 


the  destination  processor.  The  acousto-optic  device 
then  redirects  the  incident  beam  from  the  source  to 
the  destination  processor.  One  of  the  advantages  of 
this  architecture  over  the  previous  design  is  its  order 
of  micro-seconds  reconfiguration  time,  which  is  dom¬ 
inated  by  the  speed  of  sound  waves.  The  other  ad¬ 
vantage  is  its  broadcasting  capability,  which  is  due  to 
the -possibility  of  generating  multiple  waves  through 
a  crystal  at  a  given  time. 


2.2.3  Electro  Optical  Crossbar 

This  design  uses  a  hybrid  reconfiguration  technique 
for  interconnecting  processors.  There  are  N  proces¬ 
sors  each  located  in  a  distinct  row  and  column  of  the 
N  x  N  processing  layer.  For  each  processor,  there 
is  a  hologram  mqdule  having  N  units,  such  that  the 
ith  unit  has  a  grating  plate  with  a  frequency  leading 
to  a  deflection  angle  corresponding  to  the  processor 
located  at  the  grid  poiqt  (i,i).  In  addition,  each  unit 
has  a  simple  controller,  and  a  laser  beam.  To  es¬ 
tablish  or  reconfigure  to  a  new  connection  pattern, 
each  processor  broadcasts. the  address  of  the  desired 
destination  processor  to  tire  controller  of  each  of  N 
units  of  its  hologram  mpdule  using  an  electrical  bus. 
The  controller  activates  a-i&sfcr  (for  conversion  of  the 
electrical  input  to  optical  signal),  if  its  ID  matches 
the  broadcast  address  of  the  destination  processor. 
The  connection  is  made  when  the  laser  beams  are 
passed  through  the  predefined  gratings.  Therefore, 
since  the  grating  angles  are  predefined,  the  reconfig¬ 
uration  time  of  this  design  is  bounded  by  the  laser 
switching  time  which  is  in  the  order  of  nano-seconds 
using  Gallium  Arsenide  (GaAs)  technology  [9]. 

This  architecture  is  faster  than  the  previous  de¬ 
signs  and  further  it  compares  well  with  the  clock  cycle 
of  the  current  supercomputers.  One  of  the  advan¬ 
tages  of  this  simple  design  is  in  its  implementability 
in  VLSI,  using  GaAs  technology.  Unlike  the  previous 
designs,  this  can  be  fabricated  with  very  low  cost  and 
is  highly  suitable  for  applications  where  full  connec¬ 
tivity  is  required.  In  such  applications,  the  processor 
layer  area  can  be  fully  utilized  by  placing  N  optical 
beam  receivers  in  each  of  the  vacant  areas  to  simul¬ 
taneously  interconnect  with  all  the  other  processors. 
This  design  can  be  easily  adopted  to  implement  a  neu¬ 
ral  network  of  processes  with  optical  interconnects  (G). 


3  Mapping  and  Simulation  of 
Parallel  Algorithms 

An  OMC  with  N  processors  can  sitnulnle,  in  real 
time,  nil  Exclusive  Read  Exclusive  Write  (EREW) 
PRAM  iiaving  P  processors  and  Af  memory  loca¬ 
tions,  where  N=  maximum  {P,  Af}.  On  the  other 
hand,  a  P  processor  EREW  PRAM  can  simulate  in 
real  time  the  computations  of  &  P  processor  OMC. 
Thus,  the  interesting  cases  are  when  the  memory  is 
“large”.  In  this  section,  we  present  a  simple  efficient 
algorithm  for  simulation  of  a  N  processor  EREW 
PRAM  using  an  OMC  with  Nf  log  N  processors.  This 
algorithm  can  be  used  as  a  subroutine  for  simulating 
and  mapping  PRAM  algorithms  for  image  processing 
onto  OMC. 

The  input  is  assumed  to  be  of  the  following 
form.  Every  processor  is  responsible  for  routing  the 
O(logAf)  messages  that  reside  in  its  local  memory. 
Each  message  has  a  destination  tag  attached  to  it. 
The  destination  address  is  made  up  of  two  compo¬ 
nents  x,  y,  where  x  denotes  the  address  of  the  mem¬ 
ory  module,  x  <  Nf  log  AT ,  and  y  denotes  its  position 
within  the  module,  y  <  log  AT. 

In  the  first  step  of  the  algorithm,  all  elements  in 
each  PE  are  sorted  based  upon  the  position  to  which 
they  will  write  within  the  memory  module,  with  du¬ 
plicate  positions  being  sent  to  a  second  queue.  Next, 
each  of  the  elements  is  sent  out  one  by  one.  After  this, 
the  duplicates  from  the  last  iteration  are  brought  for¬ 
ward  and  the  process  is  repeated.  This  algorithm 
works  in  0(log2  N)  time  by  running  through  each 
O(logW)  time  iteration  a  total  of  log  N  times,  to  in¬ 
sure  that  all  elements  are  transferred  to  the  correct 
memory  locations.  This  leads  to: 

Lemma  1  One  step  of  an  N  processor  EREW 
PRAM  can  be  simulated  on  an  OMC  with  Nf  log  N 
processors  in  ©(log2  N)  time.  Further,  there  exists 
an  input  sequence  for  which  this  is  the  best  possible 
bound. 

The  0(log,V)  stages  of  a  slightly  modified  ver¬ 
sion  of  the  algorithm  in  the  proof  of  Lemma  1  [6], 
leads  to  a  randomized  algorithm  with  a  running  time 
of  0(logW  log  log  N)  with  high  probability,  and  a 
worst  case  time  complexity  of  0(logJ  N).  Our  rout¬ 
ing  algorithm  differs  from  others  in  the  literature  in 
the  way  randomization  is  used.  Unlike  the  algorithms 
of  (20)  and  (19),  it  does  not  randomize  with  respect 
to  paths  taken  by  messages.  For  example,  Valiant’s 
classic  scheme  for  routing  on  a  hypercube  sends  each 


message  to  a  randomly  chosen  intermediate  destina¬ 
tion  i  nd,  from  there,  to  its  true  destination.  On 
OMC  such  a  technique  would  lead  to  queues  of  size 
0(log*  N).  In  (8),  instead  of  choosing  random  paths 
for  messages  to  traverse,  their  algorithm  repeatedly 
attempts  to  deliver  a  randomly  chosen  subset  of  the 
messages.  A  by-product  of  their  strategy  is  that  their 
algorithm  requires  no  intermediate  buffering  of  mes¬ 
sages,  and  hence  works  under  the  general  situation 
where  each  processor  can  send  and  receive  polynomi¬ 
al^  many  messages.  In  (15),  the  only  use  of  random¬ 
ization  is  in  selecting  a  hash  function  to  distribute  the 
shared  dddress  space  of  the  PRAM  onto  the  nodes 
of  the  butterfly.  (Note  that  the  direct  application 
of  the  techniques  of  (15)  to  our  problem  will  lead  to 
0(logJ  N)  running  time,  and  0(iogJ  N)  local  mem¬ 
ory  requirement  for  each  processor.)  Similarly,  we 
only  use  randomization  jn  assuming  a  random  distri¬ 
bution  for  input  data.  This  random  distribution  can 
also  be  obtained  in  selecting  a  hash  function  to  dis¬ 
tribute  the  shared  address  space  of  the  PRAM  onto 
an  OMC  with  Nf  log  N  processors,  each  having  log  N 
local  memory.  The  routing  itself  is  deterministic  as 
explained  in  the  proof  of  Lemma  1,  and  has  worst 
case  running  time  of  0(logJ  N). 

Theorem  1  The  probability  that  the  simulation  of 
one  step  of  an  N  processor^ERBW  PRAM  using  an 
OMC  with  Nf  log  N  processors -would  take  more  than 

0(log  N  loglog  N)  time  is  given  by  (3e~  logl  ,  for 
some  constant  /3  independent  of  N. 

In  the  above,  as  opposed  to  the  original  algo¬ 
rithm,  we  check  each  of  the  iteration  for  the  number 
of  elements  that  are  left  to  be  sent.  If  there  are  none, 
the  algorithm  exits  else  the  program  will  run  again, 
but  will  only  iterate  according  to  the  number  of  ele¬ 
ments  left  to  be  sent.  This  leads  to  0(log  N  log  log  N ) 
running  time  with  a  high  probability.  For  more  de¬ 
tails  see  (6). 

4  Fine  Grain  Image  Comput¬ 
ing 

The  algorithms  described  in  the  last  section,  can  be 
used  as  a  subroutine  in  mapping  and  simulating  any 
PRAM  algorithm  for  image  processing  onto  the  pro¬ 
posed  models.  In  this  section,  we  present  a  summary 
of  our  results  in  designing  algorithms  directly  on  these 
models  for  fine  grain  image  computing.  For  full  de¬ 
tails  see  (6). 
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4.1  Low  and  Medium  Level  Vision 
Processing 

An  early  step  in  image  processing  is  identifying  figures 
in  the  image.  Figures  correspond  to  connected  l’s  in 
the  image.  An  TV  x  TV  digitized  picture  may  contain 
more  than  one  connected  region  of  black  pixels.  The 
problem  is  to  identify  to  which  figure  (label)  each  ”1” 
belongs  to. 

Lemma  2  Given  a  TV  x  TV  0/1  image,  all  figures  can 
be  labeled  in  O(logTV)  time  using  an  (TV/  log1/ 3  N  x 
N/ log1'*  N)-OMC. 

Convexity  plays  an  important  role  in  image  pro¬ 
cessing  and  in  vision;  many  other  problems  can  be 
solved  once  the  convex  hull  of  figures  is  obtained. 
It  has  applications  to  normalizing  pattern  in  image 
processing,  obtaining  triangulations  of  sets  of  points, 
topological  feature  extraction,  shape  decomposition 
in  pattern  recognition,  testing  for  linear  separability, 
etc.  [13]. 

Lemma  3  Given  a  TV  x  TV  0/1  image,  the  extreme 
points  of  all  the  figures  can  be  enumerated  in  0(log  TV) 
time  using  an  (TV/ log1/3  TV  x  TV/  log1/3  N)-OMC. 

Another  interesting  problem  is  to  identify  and  to 
compute  the  distance  to  a  nearest  figure  to  each  figure 
in  a  digitized  image.  In  the  following,  we  use  the  l\ 
metric.  However,  it  can  be  modified  to  operate  for 
any  Ik  metric. 

Lemma  4  Given  a  N  x  N  0/1  image,  the  nearest 
figure  to  all  figures  can  be  computed  in  0(log  TV)  time 
using  an  (TV/ log1/3  TV  x  TV/ log1/3  N)-OMC. 

Template  matching  is  another  basic  operation  in 
image  processing  and  computer  vision  [16,  17].  Tem¬ 
plate  matching  can  be  described  as  comparing  a  tem¬ 
plate  (window)  with  all  possible  windows  of  the  im¬ 
age.  Bach  position  of  the  image  will  store  the  result 
of  the  window  operation  for  which  it  is  the  top-left 
corner.  Note  that  the  computation  can  be  done  in 
0(N  x  fcJ)  time  using  a  uniprocessor.  Using  TV/  log  N 
processors,  the  above  computation  can  be  done  in 
0(JtJlogTV)  time. 

4.2  Parallel  Implementation  of  Itera¬ 
tive  methods  for  higher  level  vi¬ 
sion  processing 

Solutions  to  many  problems  in  image  understanding 
can  be  posed  in  terms  of  iterative  improvement  to  an 
initial  configuration.  For  example,  discrete  relaxation 


based  approaches  to  scene  labeling  can  be  viewed  as 
an  iterative  improvement  process.  In  such  problems, 
the  underlying  graph  is  usually  sparse,  lint  this  spar¬ 
sity  is  not  regular.  Bfli.'icnt  parallel  implementations 
of  such  relaxation  methods  is  possible  with  OMC. 

Any  iterative  matrix  structure  can  be  reulized 
by  OMC  using  devices  such  ns  holograms  (or  those 
described  previously).  Although  the  reconfiguration 
time  for  the  holograms  can  be  in  the  order  of  sec¬ 
onds  it  only  has  to  be  set  once  during  the  prepro¬ 
cessing  ph|se.  The  structure  of  the  coefficient  matrix 
is  used,  to  define  the  holographic  connections.  The 
interconnection  pattern  remains  the  same  through¬ 
out  the  computation.  An  optimal  O(logm)  time  can 
be  achieved  by  this  design' and  the  number  of  proces¬ 
sors  depends  only  on  the  number  of  non-zero  elements 
in  the  matrix  [6].  This  nfethod  is  attractive  when 
many  computations  are  to  be  performed  in  which  the 
structure  of  the  coefficient  matrix  is  fixed.  It  is  well 
suited  for  implementation  of  many  iterative  methods 
such  as  Gauss-Jordan,  Gauss-Siedel  and  the  Conju¬ 
gate  method  [5], 

Consider  the  iterative  method 

xk+x  =  Mxk+g 

\. 

where  M  is  sparse  and  nonsingular.  Let  n,-  be  the 
number  of  nonzero  elemenf.iq  the  tth  row  of  M,  and 
let  ...,  jni  be  the  columns'corresponding  to  these 
elements.  Thus,  the  above  equation  can  be  rewritten 
as 

*i+l  =  +9i- 

»=i 

Suppose  there  are  n  processors  and  each  of  the  pro¬ 
cessors  stores  exactly  one  nonzero  element  in  matrix 
M .  Then  the  following  can  be  shown: 

Lemma  5  The  OMC  can  be  used  to  solve  each  iter¬ 
ation  of  the  iterative  solution  to  any  general  sparse 
linear  systems  with  m  variables  and  n  nonzero  ele¬ 
ments  in  £?(log  m)  lime. 

5  Conclusion 

We  studied  a  set  of  electro-optical  arrays  for  effi¬ 
cient  image  computations.  Using  these,  we  showed 
O(logTV)  algorithms  for  determining  several  proper¬ 
ties  of  images.  Furthermore,  we  introduced  a  generic 
subroutine  that  can  be  used  to  map  and  simulate  the 
existing  PRAM  algorithms  on  these  models. 
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Abstract 

We  present  processor-time  optimal  parallel  algorithms  for  several  image  and  vision  prob¬ 
lems  on  a  novel  architecture  which  combines  an  orthogonally  accessed  memory  with  linear 
array  structure.  The  organization  has  p  processors  and  a  memory  of  size  0(n2)  locations. 
The  number  of  processors  p  can  vary  over  the  range  [1,  n3/2]  while  providing  optimal  speedup 
for  several  problems  in  image  processing  and  vision.  Such  problems  include  labeling  con¬ 
nected  components  and  computing  the  convex  hull,  diameter,  smallest  enclosing  rectf.pgip, 
and  nearest  neighbors  of  each  region.  Histogramming  and  computing  the  Hough  Transform 
are  also  considered.  Such  problems  arise  in  medium-level  vision  and  require  global  oper¬ 
ations  on  image  pixels.  For  these  problems,  it  is  shown  that  the  proposed  organization  is 
superior  to  the  mesh  and  pyramid  organizations. 
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1  Introduction 


Several  parallel  techniques  for  image  computations  have  been  implemented  on  distributed 
models  of  computation.  In  particular,  mcsh-conucctcd  computers  have  been  widely  used  for 
image  processing  since  the  nearest-neighbor  interconnection  among  the  processors  preserves 
the  spatial  relation  among  image  pixels  white  maintaining  a  low  hardware  implementation 
cost  [7,  11,  31].  Mesh-connected  arrays  can  efficiently  handle  low  level-image  processing 
applications  in  which  only  local  operations  are  performed  on  image  pixels  [6],  or  compu¬ 
tationally  intensive  tasks  (such  as  image  template  matching)  for  which^  suitable  mappings 
can  result  in  optimal  speedup  on  the  array.  However,  a  wide  class  of  problems  in  vision 


and  image  analysis  involves  the  computation  of  geometric  properties  of  image  regions.  Such 
problems  require  global  or  dense  data  movement  operations  on  the  image.  The  time  needed 
to  solve  such  problems  on  an  nx  n  mesh  is  lower  bounded  by  n  (proportional  to  the  diameter 
of  the  mesh)  even  if  the  problem  considered  is  not  computationally  intensive.  For  example, 
a  wide  class  of  problems  on  n  x  n  images  can  be  solved  sequentially  in  Q(n2)  time,  but  take 
0(n)  time  on  an  n  X  n  mesh  (i.e.  the  speedup  is  not  optimal).  The  mesh-diameter  problem 
can  be  alleviated  by  using  broadcast  buses  (24),  a  reconfigurable  bus  (l5),  or  a  hierarchy 
of  processors  and  buses  such  as  in  the  Pyramid  Computer  (PC)  [10,  19],  and  the  Mesh  of 
Trees  (MOT)  organization  [14,  22,  23].  However,  even  with  such  enhancements,  the  number 
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Furthermore,  algorithms  which  require  dense  data  movement  requirements  (such  as  image 
rotation  and  sorting  pixels)  can  not  be  performed  efficiently  on  these  organizations. 


In  this  paper,  we  present  a  mesh-connected  array  with  special  memory-access  and  com¬ 
munication  capabilities  to  provide  processor-time  optimal  algorithms  to  several  image  prob¬ 
lems  which  involve  global  computations  on  image  pixels.  The  problems  we  consider  include 
histogramming,  Hough  Transform,  component  labeling,  computing  convexity  and  related 
properties,  and  computing  nearest  neighbors.  Although  such  problems  are  not  compu¬ 
tationally  intensive  (except  for  computing  the  Hough  Transform,  they  all  can  be  solved 
sequentially  iu  0{n2)  time  for  annxn  image),  they  incur  a  high  communication  overhead 
when  implemented  on  a  parallel  distributed-memory  computer.  However,  for  such  prob¬ 
lems,  divide-and-conquer  and  data-rcduction  techniques  can  be  used  to  reduce  the  number 
of  computations  and  the  communication  overhead. 
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The  proposed  array  consists  of  a  memory  .array  of  size  0(n2)  locations  (onto  which  an 
n  X  u  image  can  he  mapped)  and  />  Processing  Elements  (PEs),  where  p  can  vary  over  the 
range  l  to  n2.  The  PEs  arc  interconnected  in  a  mesh  fashion  and  have  special  access  to  the 
memory.  The  organization  is  called  Mesh-Connected  Modules  (MCM)  because  it  can  he 
looked  upon  as  a  mesh  of  basic-modules,  where  each  basic-module  is  a  parallel  organization 

in  which  the  PEs  have  row  and  column  access  to  a  subarray  of  the  memory.  The  MCM 

« 

supports  global  data  operations  efficiently  and  is  thus,  highly  suited  to  the  class  of  image 
computations  considered  here.  One  major  distinguishing  feature  of  the  MCM  from  other 
mesh-based  parallel  organizations  is  that  communication  among  processors  is  done  via  a 
shared  memory  as  well  as  by  intcrproccssor  links.  Most  of  our  algorithms  take  0(n2/p)  time, 
using  p  processors,  where  p  is  in  the  range  [l,n3/2].  As  compared  to  the  performance  of  a 
pyramid  computer  for  such  problems  [19],  the  MCM  performance  is  clearly  superior.  For 

example,  the  problems  of  computing  the  connected  components  of  an  image  and  computing 

\ 

all  closest  figures  take  0(n1//2)  time,  while  computing  the  convexity  of  multiple  figures  takes 
0{nx!2  logn)  time,  on  a  pyramid  computer  with  0(n2)  PEs.  The  MCM  can  solve  all  these 
problems  in  0(n1/2)  time  using  n3/2  PEs  only. 

■r.- , . 
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The  rest  of  the  paper  is  organized  as  follows.  Section  2  introduces  the  proposed  MCM 
organization;  section  3  presents  a  summary  of  results  and  comparisons  to  other  organiza¬ 
tions;  and  section  4  presents  processor-time  optimal  parallel  algorithms  for  several  image 
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2  Architecture 


The  proposed  organization  is  based  on  a  class  of  orthogonal  memory-access  architectures 
which  have  been  studied  by  several  authors  [1,  5,  12,  27,  29].  A  VLSI-based  multiprocessor 
architecture  with  orthogonal  access  to  a  memory  array  was  first  proposed  by  Tseng,  Ilwang, 
and  Prasanna  Kumar  [29],  to  provide  optimal  speed  up  for  matrix-based  computations  and 
sorting.  Ja’Ja’  and  Owens  [13],  have  proposed  a  somewhat  similar  array  for  computing 
the  Fast  Fourier  Transform  and  sorting;  also,  Scherson  and  Sen  [27]  have  considered  the 
organization  for  sorting.  Finally,  Alnuweiri  and  Prasanna  Kumar  [1,  4,  5]  have  proposed 
efficient  data  movement  techniques  for  this  organization  which  lead  to  processor-time  opti¬ 
mal  solutions  to  a  wide  class  of  image  and  vision  problems  and  graph  theoretic  problems. 
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In  (l),  the  organization  has  been  called  a  deduced  Mesh  of  Trees  (RMOT)  because  it  can 
he  viewed  as  mesh  of  trees  organization  in  which  the  base  PEs  arc  replaced  by  memory 
modules,  and  each  row  (column)  tree  of  PEs  is  replaced  by  one  PE  with  a  row  (column) 
bus. 


2.1  A  Review  of  the  RMOT  Organization- 

An  RMOT  of  size  m  consists  of  m  PEs  each  having  row  and  column  access  to  an  m  X  rn 
array  of  memory  modules  such  that  PE i  can  access  the  modules  in  the  ith  row  and  the  ith 
column  of  the  array.  The  organization  of  an  RMOT  with  4  PEs  is  shown  in  figure  1,  where 
the  dashed  lines  represent  data-links  whose  purpose  will  be  explained  in  the  next  section. 
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Figure  1:  Organization  of  an  RMOT  (Basic  Module)  with  4  PEs 


The  PEs  have  basic  arithmetic/logic  capabilities  and  all  their  internal  registers  and  data 
paths  arc  0(logn)-bit  wide.  Also,  each  memory  module  consists  of  a  number  of  0(logn)-bit 
registers  each  of  which  is  considered  to  be  one  memory  location  within  the  memory  module. 
The  distinction  between  memory  locations  and  memory  modules  should  be  noted.  Each  PE 
can  read  or  write  one  unit  (logn  bits)  of  data  in  a  memory  location  in  its  row  or  in  its 
column  in  0(L)  time.  This  organization  has  the  advantage  that  the  processing  area  and  the 
storage  area  are  identified  as  two  distinct  components  of  the  design.  Further  architectural 
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and  operational  details  can  he  found  in  [5]. 


2.2  The  Mesh-Connected  Modules  (MCM)  Organization 


A  computa  tion  on  the  RMOT  can  be  done  by  decomposing  it  into  alternating  row-phases 
and  column-phases,  where  in  a  row-phase  (column- phase)  the  entire  computation  is  per¬ 
formed  on  data  within  the  same  row  (column)  of  memory.  In  nontrivial  applications,  the 
time  to  perform  a  computation  is  dominated  by  the  time  taken  by  each  PE  to  perform  the 
computation  on  its  memory  row  and  memory  column.  The  proposed  architecture  enhances 
the  time  performance  of  the  ItMOT  by  augmenting  it  with  more  PEs.  Each  PE  with  its 
row  or  column  bus  in  the  RMOT  is  replaced  by  a  linear  array  of  PEs,  and  the  memory 
modules  within  each  row  or  column  are  partitioned  equally  among  these  PEs,  as  shown  in 
figure  2.  Note  that  now  we  have  two  types  of  communication  links,  ihterprocessor  links  and 
memory  buses.  To  save  on  the  number  of  processors,  we  let  each  row  and  column  array  of 
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Figure  2:  Augmenting  an  RMOT  row  with  more  PEs 

PEs  share  one  PE.  The  resulting  organization  is  shown  in  figure  3  (the  distinction  between 
interprocessor  links  and  memory  buses  is  not  shown  in  this  figure).  It  is  interesting  to 
note  that  the  proposed  organization  consists  of  smaller  RMOT  arrays  (modules)  which  arc 
connected  as  a  mesh.  A  formal  definition  of  the  architecture  and  some  additional  notation 
arc  given  below. 

Definitions  and  Notation 

The  Mesh-Connected  Modules  (MCM)  organization  consists  of  p  PEs  and  0(n2)  memory 
locations.  The  PEs  operate  in  a  synchronous  mode  under  the  supervision  of  one  control 
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unit.  This  mode  of  operation  is  usually  known  as  SIMD  (Single- Instruction  Multiple  Data). 
The  MCM  is  organized  as  a  Jfc  X  it  mesh-connected  array  of  basic  mddulqs  (BMs),  where 
l  <  k  <  n.  Each  BM  is  an  RMOT  organization  with  q  PEs  (PEo,--  •  ,PEq~i)  where 
l  <  q  <  n/k;  and  a  q  X  q  array  of  memory  modules  each  of  size  0((n/ kq)2).  In  addition, 
each  PE;  of  a  BM  is  connected  to  PE;  in  each  adjacent  BM.  The  memory  module  in  the 
ith  row  and  jth  column  of  the  array  is  denoted  by  MMij.  Also,  a  memory  row  is  dcr.c  ted 
by  RM,  and  a  memory  column  is  denoted  by  CM.  It  is  assumed  that  O(logn)  bits  can  be 
moved  across  each  bus  or  data  link  in  a  constant  amount  of  time.  The  organization  of  a  BM 
was  shown  in  figure  1,  where  the  dashed  lines  represent  the  interproccssor  links  between 
each  PEs  and  its  four  possible  neighbor  PEs. 


An  MCM  can  be  specified  by  the  three  parameters  n,  It  and  q,  described  above,  where 
L  <  k  <  n  and  L  <  <7  <  n/k.  We  use  the  notation  MCM(n,  £,<7)  to  denote  an  MCM  having 
0(n2)  words  of  memory  and  organized  as  a  k  x  k  array  of  BMs,  each  having  <7  PEs.  The 
total  number  of  PEs  in  the  MCM  is  j>  —  k2q ,  and  the  total  number  of  memory  modules  is 
k2q2  (i.c.  a  total  of  0(n2)  memory  locations).  Let  m  =  n/k ,  then  each  memory  module 
has  memory  locations,  and  the  memory  array  of  each  BM  has  a  total  of  0(m2) 

memory  locations.  Several  problems  can  lie  solved  by  recursively  partitioning  the  MCM 
into  submeshes  of  BMs,  solving  the  subproblem  in  each  submesh,  then  merging  the  results. 
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A  j-parlilion  of  the  MCM,  L  <  j  <  k,  is  a  partition  of  the  MCM  into  k2/j2  suhmeshes,  eacli 
consisting  of  an  array  of  j  X  j  BMs.  Eacli  such  suhmesh  is  called  a  j-mesh.  and  denoted  by 
Af-'(t),  where  i  is  the  index  of  the  suhmesh  in  the  .partition.  Note  that  a  j-mesh  is  also  an 
where  m  =  n/k.  Also,  by  definition,  a  BM  s  a  1  -mesh. 


3  Main  Results 


It  was  mentioned  earlier  that  an  MCM(n,  i k,  9)  can  solve  several  image  and  vision  problems 

* 

in  0(n2/p)  time  using  p  =  k2q  PEs,  1  <  p  <  n3/2.  For  a  large  number’ of  vision  and  image 
problems  it  can  be  shown  that  the  MCM  is  superior  to  the  standard  mesh  computer  [16] 
and  the  pyramid  computer  [19].  Table  1  compares  the  performance  of  an  MCM(n,  -y/n,  y/n) 
having  0(n3^2)  PEs  with  that  of  the  pyramid  computer  and  the  standard  mesh-connected 
computer  each  having  0(n2)  PEs.  Definitions  and  algorithmic  details  for  these  problems 
are  given  in  section  4. 
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Table  1:  Performance  comparison  of  an  MCM  with  pyramid  and  mesh  computers 


As  an  enhancement  to  the  standard  mesh,  Miller  and  Stout  [18]  considered  a  variable- 
diameter  mesh  which  consists  of  a  yjp  x  y/p  array  of  PEs,  each  having  a  local  memory  of 
size  n2/p  locations.  They  have  shown  that  the  problems  listed  in  table  1  can  be  solved  in 
0(n2/p)  time  on  the  variable-diameter  mesh,  where  l  <  p  <  n4/3.  The  MCM  is  superior 
to  this  mesh  since  it  provides  0(n2/p)  time  solutions  for  l  <  p  <  n3/2.  In  fact,  it  is  easy 
to  sec  that  the  variable-diameter  mesh  is  a  special  instant  of  the  MCM(n,  k,q)  obtained 
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by  choosing  <j  =  l  (thus  />  =  Jfc2,  and  1  <  k  <  n2/3).  Thus,  the  MCM  algorithms  can  he 
implemented  directly  ott  the  variable- diameter  mesh. 


4  Parallel  Vision  Algorithms  on  the  MCM 


This  section  presents  optimal  solutions  to  several  image  problems  on  the  MCM.  Our  ap¬ 
proach  is  based  on  combining  the  divide-and-conqucr  paradigm  with  the  optimal  parallel 
techniques  that  can  be  implemented  on  the  basic  modules.  The  MCM  techniques  basically 
consist  of  performing  parallel  merge  on  the  results  from  th,e  basic  modules.  Optimality 
is  achieved  by  carefully  balancing  the  BM  techniques  and  the,mergc  operations  as  will  be 
shown  later.  Processor-time  optimal  parallel  algorithms  for  solving  several  image  and  vision 
problems  on  a  BM  have  been  presented  in  [1,  5].  Table  2  summarizes  these  results.  These 
algorithms  form  the  basic  subtechniques  for  the  MCM  algorithms. 

Radix-Sorting  of  m2  integers 

Parallel  Prefix  Computations  on  m2  elements 

Problems  on  an  m  X  m  Image: 

1.  Histogram  computations 

2.  Histogram  modification  (equalization) 

3.  Labeling  connected  components 

4  P.onvpy  bull  of  each  figure 

5.  Diameter  of  each  figure 

6.  Smallest  enclosing  rectangle  of  each  figure 

7.  All-point  closest  neighbors 

8.  Closest  figures 

Table  2:  Problems  that  can  be  solved  in  0(m2/q)  time  on  a  BM  with  q  PEs 
(Alnuweiri  and  Prasanna  Kumar  [1,  5]) 


Before  we  proceed  further,  it  is  necessary  to  introduce  some  additional  notation  which 
will  clarify  the  description  of  our  algorithms.  The  image  is  initially  partitioned  into  k 2 
disjoint  squares  each  of  size  m  x  m  pixels,  where  m  =  n/k.  Each  such  square  is  called  a 
basic-square  and  stored  in  a  distinct  BM.  Given  an  n  x  n  image  F,  a  j-partition  of  the  F 
is  defined  to  be  the  partitioning  of  F  into  disjoint  squares  each  consisting  of  j  X  j  basic- 
squares  of  the  image.  Each  such  square  is  called  a  j-square  and  denoted  by  Q3{i),  where 
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i  —  0, 1 ,  •  •  •  ,(fc/j)2  —l  is  the  index  assigned  to  each  square.  Note  that  the  size  of  a  j -square 
is  jin  x  jin  pixels.  The  squares  of  a  j -partition  of  the  image  are  indexed  such  that  QJ(i) 
is  stored  in  MJ(i).  For  j  <  2,  each  Q*(j)  can  he.  partitioned  into  four  equal  quadrants, 
denoted  by  QJ0,  Q\ ,  Q32,  Q }3.  Also,  note  that  Q°(i)  denotes  the  basic-square  in  DM{i).  The 
boundary  of  a  j -square  consists  of  topmost  and  bottom-most  rows  of  pixels  and  the  leftmost 
and  rightmost  columns  of  pixels  within  the  j -square.  The  boundary  of  an  image  square  Q 
is  denoted  by  £((?).  The  pixels  on  the  boundary  of  a  j -square  arc  stored  in  the  PEs 
of  the  boundary  BMs  of  MJ(t). 

A 

4.1  Histogramming  on  the  MCM  1 

t 

Histogramming  is  a  widely  used  operation  in  the  early  processing  of  images,  and  in  im¬ 
age  enhancement  applications.  The  histogramming  problem  can  be  defined  as  follows  [26]: 
Given  an  n  x  n  image  I  having  H  gray-level  values  0, 1,  •  •  • ,  H  -  l,  compute  the  number  of 
pixels  having  the  gray-level  value  g  for  each  g  =  0, 1,  •  •  • ,  H  -  1.  Each  gray- level  value  can  be 
represented  by  a  binary  number  gh-i,gh-2,  •  •  •  ,ffo>  where  h  -  flogi/].  ,£he  histogramming 
problem  for  an  m  x  m  image  can  be  solved  in  0[m2/q)  time  on  a  BM  with  q  PEs  by  using 
the  parallel  radix-sort  technique  described  in  [5]  (sec  table  1).  This  result  is  processor- 
time  optimal  since,  sequentially,  this  problem  requires  fl(m2)  time.  Histogramming  can  be 
performed  efficiently  on  the  MUM  by  using  data  reduction  and  divme-ana-conquer  tech¬ 
niques.  Our  technique  consists  of  two  stages:  a  sorting  stage,  and  a  key-counting  stage. 
Suppose  that  the  given  n  X  n  image  consists  of  H  different  gray-levels,  the  two  stages  can 
be  performed  as  follows: 


•  Sorting  Stage:  Partition  the  MCM  into  j-meshes  each  having  a  total  of  at  least 
H  memory  locations,  i.e.  j2m2  >  H  (for  simplicity,  assume  that  II  =  i2m2).  Each 
memory  location  in  such  a  j-mesh  corresponds  to  a  distinct  key  (gray-level  value). 
Within  each  j-mesh ,  sort  the  pixels  by  their  gray-level  values,  then  compute  the  sum 
of  the  pixels  in  each  gray-level,  and  finally,  store  the  sum  into  the  memory  location 
corresponding  to  that  gray-level  key.  It  is  clear  the  time  complexity  of  this  stage  is 
dominated  by  sorting  the  pixels  within  each  j-mesh ,  which  is  0{jv~). 

•  Key-Counting  Stage:  This  stage  is  performed  using  a  dividc-and-conqucr  tech¬ 
nique.  Since  the  initial  size  of  each  suhmesh  is  j  xj  BMs,  this  stage  takes  logfc-  log;  = 
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log (k/j)  iterations.  In  iteration  i,  the  size  of  each  suhmesh  is  (j 2*  X  j‘2')  DMs.  Be¬ 
cause  the  number  of  different  keys  is  fixed  and  equal  to  II ,  After  each  iteration  the 
keys  are  divided  equally  among  the  RMs  of  .each  t-mesh,  t  —  j'l1.  Furthermore,  the 
counts  (sum  of  elements)  of  the  same  key  in  different  t -meshes,  arc  stored  in  the  same 
memory  location  within  each  t-mesh.  Step  i  of  this  stage  is  performed  as  follows: 

1.  In  this  iteration,  the  sum  of  pixels  for  each  key  in  a  2l-mesh,  t  =  j 2',  is  to  be 

computed  by  adding  the  four  sums  for  that  key  in  the  four  quadrants  of  the 
2 t-mesh.  Since  these  four  sums  are  stored  in  the  same  memory  location  within 
each  quadrant,  this  operation  can  be  performed  in  0(j2*  +  f ^  ] )  time  by 

rotating  the  data  within  each  row  and  column  of  I^Es  in  a  2i-mesh. 

2.  Now  we  partition  the  keys  among  the  rows  of  each  2 t-mesh,  so  that  each  RM 

will  contain  a  smaller  number  of  keys.  This  can  be  done  in  0(j 2*  +  *  -^1) 

time  by  a  simple  rotation  of  data  in  each  row  and  column  of  memory  within  each 
2t-mcsh. 


The  total  time  taken  by  the  log(Jt/j)  iterations  is 


i :  . . 


E 

»=o 


E  *  +  r=-£i 


which  is  bounded  by  0(k  +  ^  •  y/H).  For  processor-time  optimality,  we  must  have 


771  in  ^ 

-Vli  <  —  =>  H  <  m2, 

9  9 

i.e.,  the  memory  size  of  each  BM  must  be  equal  or  greater  than  the  maximum  number  of 
keys  (gray-levels).  Thus,  we  have  the  following  theorem: 


Theorem  1  Given  an  image  with  II  or  less  gray-levels,  computing  the  histogram  of  the 
image  can  be  done  in  0(k  +  —  •  V7l)  lime  on  an  MCM(n,  k,q)  with  p  =  k2q  processors. 
If  II  =  0(m2),  where  m  =  n/k,  then  the  histogram  can  be  computed  on  this  MCM  in 
0(k  •!-“■)=  0(k  +  ^)  lime,  which  is  optimal. 


Histograms  are  used  in  a  large  number  of  applications  such  as  image  enhancement.  For 
example,  in  a  certain  image  some  gray-levels  may  be  heavily  populated  while  other  levels 
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may  have  a  much  lower  number  of  pixels.  To  enhance  the  dynamic  range  of  the  image,  a 
redistribution  of  gray-level  values  may  he  desired.  This  can  be  done  by  a  technique  called 
histogram  modification  [26],  in  which  given  a  histogram  IIm  for  an  image  F,  it  is  required 
to  transform  the  gray-scale  of  the  image  so  as  to  give  the  image  a  specified  histogram  . 
The  new  histogram  has  the  same  number  of  gray-levels  as  the  old  histogram  but  has  a 
different  number  of  pixels  in  each  level.  It  is  easy  to  show  that  image  enhancement  by  this 
method  can  be  performed  on  an  MCM(u,  fc,  q)  in  0(Jfe+  )  time,  where  H  is  the  number 
of  the  gray  levels.  Again,  if  we  choose  l{  —  m2,  then  this  time  becomes  0(k  +  which  is 
optimal  for  1  <  p  <  n3/2  and  1  <  fc  <  n2/3. 

« 

* 

t 

4.2  Line  Detection  by  the  Hough  Transform 

The  Hough  Transform  is  a  general  technique  for  feature  detection  in  images.  The  method 
involves  accumulating  evidence  for  parametrized  objects  from  the  image  feature  space  [9]. 
The  principle  of  line  detection  using  the  Hough  Transform  is  based  oii  transforming  each 
pixel  (t,  j)  in  the  image  (also  called  the  feature  space)  into  a  sinusoidal;curve  in  the  param¬ 
eter  space.  This  transformation  is  defined  by  the  normal  representation  of  a  line  (passing 
through  (i, ;)),  which  is  given  by 

o  =  tsin0  4-  i  cos0.  (1) 

The  computation  of  the  Hough  Transform  proceeds  by-  associating  an  accumulator  cell 
A(s,  t)  with  each  point  ( p,,0t )  (in  the  parameter  space)  which  keeps  a  count  of  the  number 
of  curves  intersecting  at  that  point.  For  computational  purposes  the  ranges  of  p  and  0  are 
quantized  into  a  number  of  levels.  If  the  number  of  quantized  levels  of  0  (respectively,  p)  is 
L  (respectively,  M),  then  we  have  ML  accumulator  cells  each  associated  with  a  pair  (/?,,<?*) 
of  quantized  values,  where  l  <  s  <  M  and  1  <  t  <  L. 

Suppose  that  Oi,  •••,#£,  arc  the  L  given  samples  of  0,  then  the  Hough  Transform  can 
be  computed  using  L  histogramming  steps.  In  Step  i,  each  image  pixel  (i,j)  increments 
the  count  of  A(p,,0t)  by  one,  where  p,  —  isintf*  +  jcosOi.  Implementing  this  technique 
directly  on  the  MCM(n,fc,qr)  leads  to  0(Lk  +  Ln2/p)  time.  The  term  Ln2/p  in  the  time 
complexity  is  within  the  optimal  bound;  however,  the  term  Lk  is  not  for  sufficiently  large 
values  of  k  and  L ,  e.g.  L  =  0(n),k  =  0(n '/2).  The  method  can  be  improved  by  modifying 
the  key-counting  stage  of  our  histogramming  technique.  The  complete  algorithm  is  rather 
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lengthy  to  dcscril»e  here.  We  will  consider  a  special  ease  which  shows  how  optimality  can 
he  achieved  on  the  MCM. 

Consider  computing  the  Hough  Transform  of  an  n  X  n  image.  Let  L  he  the  number  of 
values  of  0  and  M  he  the  number  of  values  of  p,  and  let  //  =  LM  he  the  total  number  of 
accumulator  cells  needed  to  compute  the  transform.  Now  let  M  =  m2  and  L  =  Jk2,  where 
as  before  m  =  n/k.  Bach  BM{,  1  <  i  <  fc2,  has  m2  memory  locations  which  can  store 
the  AT  =  m2  values  of  p  for  a  particular  0;.  In  the  computation  each  BM  computes  the 
Hough  Transform  for  its  image  pixels  for  a  certain  0.  This  can  be  done  in  0(m2/g)  time 
by  histogramming.  The  m2  values  of  p  are  then  shifted  to  an  adjacent  "BM,  so  that  values 

I.  » 

of  p  for  a  new  value  of  0  can  be  computed.  Using  this  technique  each  BM  computes  the 

» 

values  of  p  for  the  L  —  k2  values  of  0.  Since  each  such  computation  takes  0(m2/q)  time, 
the  total  time  taken  by  each  BM  is  0(k2m2 /q).  The  data  transfer  time  is  also  0(k2m2 /q). 
Since  H  =  LM  —  k2m2,  the  computation  time  on  an  MCM(n,  Jb,gj  is  0(H/q)  =  0(L~) 
using  p  =  k2q  PEs,  which  is  optimal  since  a  sequential  computation  of  the  ItT  would  take 
0(Ln2)  time. 

( . 

** 

Theorem  2  Computing  the  Hough  Transform  can  be  done  in  0(k  +  L^)  time  on  an 
MCM(n,  k ,  g)  with  p  =  k2q  processors,  where  L  is  the  number  of  discrete  values  of  0  in 

4U~  - „„„„„ 
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4.3  Computing  Image  Connected  Components 

A  black  and  white  (binary)  image  may  consist  of  several  connected  regions  of  black  pixels. 
A  connected  component  (or  a  figure)  is  a  maximally  connected  region  of  black  pixels.  The 
image  labeling  problem  is  to  identify  a  unique  label  with  each  figure  in  the  image.  The 
labeling  problem  can  be  solved  by  using  a  bottom-up  recursive  procedure  in  which  the 
figures  in  a  k  x  k  square  are  labeled  first  by  labeling  the  connected  regions  within  each 
|  X  |  quadrant  of  the  square,  then  by  using  the  adjacency  information  along  the  common 
boundaries  of  the  quadrants  to  label  the  connected  regions  within  the  k  X  k  square.  Nassimi 
and  Sahni  have  proposed  this  approach  for  meshes  (20),  and  since  then  it  was  implemented 
on  several  other  parallel  organizations  such  as  the  pyramid  computer  (19),  MOT  [23],  and 
hypercubc  computers  (8).  The  same  technique  can  be  implemented  on  the  MCM  as  follows: 
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Initially,  the  figures  within  each  basic-square  can  be  labeled  in  0(^-)  time  by  using  a 
module-algorithm.  Next,  a  recursive  merging  algorithm  is  used  to  compute  the  labels  of 
the  figures.  The  algorithm  consists  of  logit  iterations.  In  iteration  i,  1  <  t  <  logfc  -  l,  the 
figures  within  each  j -square,  j  =  2‘,  arc  labeled  by  considering  the  pixels  on  the  boundaries 
of  its  four  quadrants.  The  merge  is  done  first  by  constructing  an  auxiliary  graph  GJ(/i) 
representing  the  connectivity  among  the  figures  in  the  four  quadrants  of  each  QJ(/i),  then 
computing  the  connected  components  of  each  such  graph.  Each  vertex  in  G*(h)  represents 
a  figure  which  intersects  the  boundary  of  a  quadrant.  The  edges  of  G*{h)  arc  constructed 
as  follows:  (vx,vy)eE  if  two  figures  labeled  x  and  y  in  two  adjacent  quadrants  have  at  least 
two  pixels  which  arc  4-ncighbors  along  the  coimnon  boundary  of  the  two  quadrants.  It  is 
clear  that  G-7  has  O(jm)  vertices  and  0(jm)  edges. 


Lemma  1  Given  the  graph  G-7  constructed  as  above,  and  stored  in  a  y-mesh,  the  connected 
components  of  this  graph  can  be  computed  in  0(j  +  sf]  •  ~  ■  log(ym))  time. 

Proof:  The  O(jm)  edges  of  the  given  graph  G-7  are  stored  in  a  column  (or  row)  of  boundary 

BMs.  Let  these  edges  be  stored  in  the  leftmost  column  of  BMs  in  the  j-mesh.  The  connec- 

• 

tivity  algorithm  we  use  is  basically  an  implementation  of  the  parallel  connectivity  algorithm 
in  [28].  For  a  graph  with  jm  vertices,  this  algorithm  requires  0(log(jm))  iterations  on  a 
OP.CW  PRAM  with  U{jrn)  PEs.  In  each  step,  each  PE  performs  a  constant  time  operation 
on  its  local  data.  To  implement  this  algorithm  on  the  MCM,  each  read/write  step  on  the 
PRAM  is  simulated  by  a  random  access  read/write  operation  [21]  on  the  MCM.  Once  each 
PE  has  the  appropriate  data  in  its  RM,  it  performs  the  desired  operations  on  this  data. 

To  implement  this  algorithm  on  a  j-mesh  in  which  G 3  is  stored,  we  first  move  the  O(jm) 
edges  of  the  graph  into  a  l^'l^-mesh  within  the  j-mesh ,  where  as  before  j  —  2*.  This  can 
be  done  as  follows:  First  partition  the  2*  BMs  into  2 groups,  each  consisting  of  2l-’/2J 
BMs.  Move  the  data  within  each  such  group  into  a  distinct  column  of  BMs,  such  that  the 
2fl/,zl  groups  are  stored  in  2l‘/zJ  adjacent  columns  of  BMs.  This  can  be  in  0(2^/^  +m/q) 
time  by  moving  the  data  within  each  row  of  PEs.  The  data  within  each  such  column  of 
BMs  is  moved  upward  (within  the  j-mesh)  so  that  all  the  edges  arc  now  in  a  2L*/2J  x  l^!2^ 
array  of  BMs.  If  t  is  even  then  this  array  constitutes  a  y/j-mesh.  If  t  is  odd  then  this  array 
constitutes  the  upper  half  of  a  2^' I2]  -mesh.  Each  PE  within  the  smaller  mesh  has  0(m/q) 
edges  only  in  its  RM.  This  data  movement  is  illustrated  in  figure  4,  and  it  can  be  completed 
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in  0(2*  H-  2l'^2J  tn/q)  time. 
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Final  poaltlon  of  data 


Figure  4:  Concentrating  data  into  a  smaller  submesh 

\. 


Within  the  2^ '^-mesh,  each  random  access  read/write  operation  t&n’-be  performed  in 
0(2W2lm/q)  =  0(y/J  •( m/q ))  time  (2|.  Thus,  the  log(jm)  iterations  of  the  connectivity 
algorithm  in  [28]  can  be  performed  in  0(\/j  •  {m/q)  •  (log  jm))  time.  □ 

After  computing  the  connected  components  of  G the  correct  iabeis  of  the  pixels  on 
the  boundary  of  each  j -square  (and  also  on  the  boundary  of  its  four  quadrants)  can  be 
computed.  The  procedure  is  then  repeated  for  each  ( 2j)-square ,  (4 j)-square,  and  so  on 
until  the  entire  image  has  been  processed.  The  time  for  iteration  i  of  the  algorithm  is 
0(j  +  >/J  •  ~  •  log(jm)),  where  j  =  2',  and  the  time  for  the  initial  module  computation  is 
0{m2/q).  Thus,  the  time  for  all  logfc  iterations  is, 

loBfe  /  .  .  .  \ 

m2/q+  (2‘  +  2,/2(m/g)(Iog2'm)J 

i=i 

which  is  bounded  by  0(k  +  m2/q)  =  0(k  n2/p),  where  p  =  k2q. 


Since  only  the  labels  of  the  pixels  on  the  boundary  of  each  quadrant  are  updated  in 
each  iteration,  some  pixels  may  still  have  old  incorrect  labels  at  the  end  of  the  computa¬ 
tion.  These  pixels  can  be  labeled  with  the  correct  labels  by  applying  a  top-down  recursive 
procedure  similar  to  the  above  procedure  but  in  the  reverse  direction.  That  is,  use  the 
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labels  of  the  pixels  on  the  boundary  of  each  j -square. to  label  the  pixels  on  the  boundaries 
of  its  four  quadrants.  This  is  done  starting  from  the  square  Qk  which  consists  of  the  entire 
n  X  n  image,  and  then  applying  the  technique  until  the  size  of  each  quadrant  is  m  x  m 
pixels.  Each  such  quadrant  is  local  to  one  UM  and  its  pixels  can  be  labeled  with  the  correct 
labels  in  0(m2/p)  time  using  a  module-algorithm.  The  following  theorem  summonses  this 
result. 


Theorem  3  Given  an  n  x  n  black  and  white  image,  all  connected  regions  of  black  pixels 
in  the  image  can  be  labeled  in  0(k  +  n2/p)  time  on  an  MCM(n,k,q)  jvith  p  PEs,  where 
1  <  p  <  n3/2,  and  k  <  n2/3.  ’•  ; 


Besides  being  optimal,  our  result  is  superior  to  previous  known  results.  For  example, 

\ 

using  n3/2  the  MCM  can  provide  0{nx^2)  time  solution  to  the  above  problem.  A  pyramid 
computer  can  provide  the  same  solution  time  but  using  n2  PEs  [19].  A  2MCC  with  n2 
PEs  provides  0(n)  time  solution  to  this  problem  [20].  The  hypercube  and  shuffle-exchange 
networks  can  provide  0(log2  n)  time  solutions  using  n2  PEs  [8],  but  they,  require  a  much 
larger  area  if  implemented  in  VLSI. 

4  *  /"i - „ 

-•  -  -  -  O'  -  ~  “ 

We  now  consider  the  problem  of  computing  the  convex  hull  of  each  figure  (connected  com¬ 
ponent)  in  an  n  x  «  image  in  which  all  figures  have  been  labeled.  A  set  5  of  pixels  is  said 
to  be  convex  if  the  smallest  convex  polygon  containing  them  contains  no  other  pixels.  Such 
polygon  is  called  the  convex  hull  of  5,  and  it  is  denoted  by  CH(5).  The  MCM  algorithm 
is  base  "  ic  dividc-and-conquer  paradigm  for  merging  disjoint  convex  hulls  [25].  Two 
convo  .iuus  are  said  to  be  disjoint  if  there  exists  a  vertical  line  in  the  plane  such  that  all  the 
vertices  of  one  hull  lie  on  or  to  the  left  of  this  line,  and  all  the  vertices  of  the  second  hull  lie 
to  the  right  of  this  line.  By  merging  two  convex  hulls  Ai  and  A 2,  we  mean  computing  the 
convex  hull  CH{A\,At)  of  At  and  A2.  Merging  two  disjoint  convex  polygons  A 1  and  A 2 
can  be  done  by  computing  the  two  tangents  common  to  At  and  A 2,  and  by  eliminating  the 
vertices  which  become  internal  to  the  resulting  polygon.  To  compute  these  tangent  lines, 
it  is  useful  to  partition  At  (A2)  into  an  upper  hull  and  a  lower  hull.  Let  l  and  r  be  the  two 
vertices  with,  respectively,  the  smallest  and  largest  z-coordinates  among  all  the  vertices  of 
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Figure  5:  Partitioning  convex  hull  ' 

Ai  (^2)-  The  upper  hull  consists  of  all  the  vertices  of  Ai  (A2)  which,  lie  on  or  above  the 
line  Tv,  and  the  lower  hull  consists  of  all  the  vertices  below  this  line. 

i :  . 

t 

Our  merging  technique  is  based  on  the  following  procedure:  Suppose  we  have  two  convex 
hulls  (polygons),  CHI  and  CH2.  We  consider  the  merge  procedure  for  the  upper  two 
hulls  of  CHI  and  CH2  (denoted  UH1  and  UII2,  respectively).  The  same  procedure  can 
be  applied  to  merge  the  lower  hulls.  Let  ulil=  (wi,U2r* •  )V4)  and  UH2=  {ui,U2, •  •  •  >u«)> 
where  the  u;’s  (u,’s)  are  the  vertices  UHl  (UII2)  given  in  the  increasing  order  of  their 
^-coordinate  (see  figure  5).  Mark  j  vertices  V  =  (v;, ,  v;,,  •••,u;J)  from  UHl  and  j  vertices 
U  =  (uf, •••lu,</)  from  UH2,  where  u,-,  =  vi,v,)  =  va,ui{  =  u1}  and  u,-,  =  ua.  The 
marked  vertices  are  also  ordered  by  their  x— coordinate.  The  sets  V  and  U  partition  UHl 
and  UH2  into  j  -  1  segments,  where  a  segment  [u;r,  u;r+i )  ((u;r, «ir+l])  consists  of  all  the 
vertices  of  UHl  (UII2)  in  between,  and  including,  V{r  and  u,r+1  (tqr  and  u,r+I).  Note  that 
the  segments  do  not  necessarily  contain  the  same  number  of  vertices.  Finally,  we  assume 
each  vertex  in  V  ( U )  is  stored  in  a  record  together  with  the  two  edges  from  UHl  (UII2) 
incident  upon  it.  The  two  hulls  UHl  and  UI12  can  be  merged  in  two  stages  as  follows: 

Procedure  MERGE- HULL 


•  Stage  1:  This  stage  performs  the  binary  search  procedure  only  on  the  vertices  in  V 
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and  U  and  the  edges  incident  upon  them.  Each  iteration  of  the  algorithm  consists 
of  choosing  two  vertices  v,\  and  u;t,  constructing  the  line  joining  them,  ami  then 
testing  if  this  line  is  tangent  to  both  UHl  aud  Ult2  (25).  Thus,  this  stage  terminates 
in  at  most  (logj)  iterations.  Upon  termination,  cither  a  common  tangent  is  found 
(and  we  are  done)  or  two  segments  of  vertices,  one  from  UHl  and  one  from  UH2,  are 
identified.  These  two  segments  contain  the  two  vertices  through  which  the  common 
tangent  passes. 

•  Stage  2:  This  stage  computes  the  common  tangent  by  performing  the  binary  search 

procedure  on  vertices  of  the  two  segments  identified  in  the  previous  stage.  This  stage 

* 

terminates  in  at  most  (log  t)  iteration,  where  t  is  the  tbtal  number  of  vertices  in  the 
two  segments.  ’ 


To  merge  pairs  of  convex  hulls  in  two  adjacent  image  squares,'  the  above  two  stages 
must  be  applied  simultaneously  to  all  pairs  of  convex  hulls  which  touch  along  the  conunon 
boundary  of  the  two  squares.  The  following  steps  show  how  the  aboVe  procedure  can  be 
implemented  on  the  MCM:  ■  - 

*  v 

1.  Initial  module  computation:  We  start  by  computing  the  convex  hulls  for  all  the 
figures  within  each  basic  square.  Since  each  basic  square  is  local  to  one  BM,  this 
computation  can  be  completed  in  0{m-  jq)  time  using  a  BM  algorithm.  Note  that  if 
a  figure  extends  over  several  basic  squares,  then  this  figure  will  have  a  convex  hull  in 
each  one  of  these  basic  squares.  All  the  convex  hulls  of  one  figure  must  be  merged  in 
order  to  obtain  the  final  convex  hull  of  the  figure.  For  the  merge  phase,  we  need  only 
to  consider  convex  hulls  of  figures  which  have  at  least  one  pixel  on  the  boundary  of  a 
basic  square. 

2.  Recursive- merge:  This  phase  consists  of  a  recursive  procedure  which  runs  in  logfc 
iterations.  In  iteration  i,  l  <  t  <  logfc,  we  compute  the  convex  hulls  of  the  figures 
within  each  j -square,  Q\  where  j  =  2’.  At  the  beginning  of  this  iteration,  the 
convex  hulls  of  the  figures  within  each  quadrant  of  Q}  have  been  computed.  So  we 
need  to  merge  each  group  of  (at  most  four)  convex  hulls  of  the  same  figure  in  the 
four  quadrants.  It  can  be  shown  that  all  pairs  of  hulls  can  be  merged  in  |  log(;m) 
iterations  by  using  a  binary-search  procedure  (25).  Our  merge  procedure  is  based  on 
implementing  this  binary  search  technique.  The  details  of  the  implementation  arc 
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rather  tedious  and  are  left  out  to  the  full  version  of  this  paper.  The  following  theorem 
summerizes  our  result. 

Theorem  4  Given  an  n  x  n  image  in  which  all  figures  have  been  labeled,  the  convex  hull 
of  each  figure  can  be  computed,  for  all  figures  simultaneously,  in  0(n2 /p)  time  on  an 
MCM(n,  k,  q)  with  p  PEs,  where  1  <  p  <  n3/2  and  k  <  n2^3. 

Using  a  similar  computation,  several  other  geometric  properties  can  he  computed  for 

4 

each  figure  in  the  image.  For  example,  computing  the  diameter  and  -a  smallest  enclosing 
rectangle  for  each  figure  can  be  computed  in  0(n2/p)  time  on  an  MCM (n,k,q)  with  p  PEs, 
where  1  <  p  <  n3/2  and  k  <  n2/3.  Again  for  p  =  n3/2,  the  above  problems  can  be  solved  in 
(^(n1/2)  time,  which  is  superior  to  the  0(n)  time  solution  on  a  standard  n  x  n  mesh  of  PEs 
[16],  and  the  0{ny!2  logn)  solution  on  a  pyramid  computer  of  size  n2  [19]. 

\ 

4.5  Computing  All  Closest  Neighbors 

i .  , . 

*» 

We  consider  two  problems  related  to  computing  distances  among  image  regions.  The  first 
problem  we  consider  is  the  closest  neighbor  problem.  In  this  problem  we  are  given  annxn 
macs  ana  white  image,  ami  we  are  required  to  compute  for  each  pixel  p  (eiihei  black  oi 
white)  a  closest  black  pixel.  The  computed  black  pixel  will  be  called  a  closest  neighbor  of  p. 
As  a  measure  of  distance,  we  will  use  the  L\  metric.  In  this  metric  the  distance  between  two 
pixels  (i,  j)  and  (u,  v)  (denoted  by  DIST[(i  j),(u,v)]  is  given  by  |i  -  u|  + 1  j  — 1>|.  A  bottom-up 
recursive  approach  can  be  used  for  this  problem.  To  compute  the  closest  neighbor  for  each 
pixel  in  a  j -square  square  Q3 ,  we  first  solve  the  problem  within  each  quadrant  of  Q3  and 
then  merge  the  results  from  the  quadrants.  It  can  be  shown  that  in  the  merge  step  only 
the  boundary  pixels  of  the  quadrants  need  to  be  considered. 

The  divide-and-conqucr  procedure  proceeds  as  follows^  Initially,  the  closest  neighbors 
arc  computed  within  each  basic  square.  Since  each  basic  square  is  local  to  one  BM,  this 
can  be  done  in  0(m2/q)  time  using  a  module-algorithm.  A  bottom-up  recursive  algorithm 
is  then  used.  In  the  beginning  of  iteration  t  of  this  algorithm,  l  <  i  <  log k,  the  boundary 
pixels  of  each  quadrant  of  a  j-square ,  j  =  2',  have  computed  their  closest  neighbors  within 
that  quadrant.  Iteration  i  consists  of  computing  the  closest  neighbors  of  the  boundary 
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Figure  G:  Closest  neighbor  computations  4 

I. 

pixels  of  each  j-squarc,  by  merging  the  information  along  the  boundaries  of  its  quadrants. 

Theorem  5  Given  annxn  black  and  white  image,  the  closest  neighbor  of  each  pixel  in  the 
image  can  be  computed  in  0(k  +  n2/p)  time  on  an  MCM(n,k,q)  with  p  PEs,  1  <  p  <  n3/2 
and  k  <  n2^3.  ’ 

r.  . 

Proof:  Let  be  the  j-mesh  containing  the  j -square,  Q\  and  let  M^l2\l),  be  the  (j  / 2)- 
mesh  (within  which  contains  quadrant  QJt,  0  <  l  <  3,  of  Q 1  (see  figure  6).  The  major 
sten  in  the  inerec  urocedure  is  the  data  transfer  needed  to  enable  each  pixel  on  the  boundary 
of  a  quadrant  to  compute  its  nearest  neighbors  in  the  other  three  quadrants.  Once  a  pixel 
has  this  information,  computing  its  nearest  neighbors  can  be  done  in  0(1)  time.  Consider 
the  pixels  on  the  boundary  of  quadrant  Q\.  Each  PE  in  the  rightmost  column  (topmost 
row)  of  BMs  in  2)  reads  the  closest  neighbors  of  the  boundary  pixels  in  its  neighbor 

PE  in  the  leftmost  column  (bottom-most  row)  of  BMs  in  3)  (M^2^(0)).  Since  each 

such  PE  has  0(rn/q)  boundary  pixels  in  its  RM,  this  operation  can  be  completed  in  0(m/q) 
time.  The  pixel  (u  -  l,v  +  1)  in  the  south-western  corner  of  Q 2  can  be  also  moved  into 
the  PE  containing  pixel  (u,  v)  in  the  north-eastern  corner  of  Q}2,  in  0(1)  time.  Now  the  PE 
containing  (u,  v)  can  compute  the  closest  neighbors  of  ( u,v )  within  quadrants  Qq,  Q\,  and 
Q{.  These  three  closest  neighbors  are  then  broadcast  to  all  boundary  PEs  in  2). 

This  can  be  done  in  0(j  +  log  7  mjq)  Mine.  Now  each  PE  in  the  rightmost  column  and 
topmost  row  of  BMs  in  M^^2\ 2),  has  all  the  informatior  necessary  to  compute  the  correct 
distances  for  each  pixel  in  its  portion  of  tl"i  boundary.  Since  each  boundary  PE  has  r)[u.jq'' 
such  pixels,  this  can  be  done  in  0(m/q)  time.  A  similar  computation  is  perfor  nod  to  update 
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the  closest,  neighbors  of  the  pixels  on  the  boundary  of  Q*.  Thus,  iteration  t  of  the  merge 
algorillim  requires  0(j  I-  log r/  4 ■  m/q)  time,  where  j  •-  2*.  The  time  taken  by  the  entire 
algorithm  is 


—  +  53  (2‘  +  —  +  \ogq)  =  0(k  +  —). 

Where  the  0(m2/q)  term  is  the  time  taken  by  the  initial  module-algorithm.  Since  m  =  n/k 
and  p  =  k2q,  the  time  taken  by  the  algorithm  is  0(k+n2/p)  which  is  optimal  for  1  <  p  <  n3?2 
and  k  <  n2/3.  □  * 

A  closely  related  problem  is  the  problem  of  computing  for  each  figure  F ,  the  label  of  the 
nearest  figure  to  it  in  the  image.  As  before,  define  DIST(p,  q)  to  be  the  distance  between 
two  pixels,  p  and  q  according  to  L\  metric.  Then  the  label  of  the  nearest  figure  to  figure  F, 
is  the  label  of  the  pixel  q  such  that 

DIST(p,9)  =  nun  {DIST(p,9)|peF,<?  jLF) 

P.9 


Corollary  1  Given  an  n  x  n  image  in  which  all  figures  have  been  labeled,  a  nearest  fig¬ 
ure  to  each  figure  can  be  identified,  for  all  figures  simultaneously,  in  0(n2/p)  time  on  an 

%  *  *■  /  •  »  *  ^  ^  3/?  *  »  /  ?/.3 

UJllU  p  fV/bf  WUCIC  l  ^  p  *«•  '  u#m*  ^ 


Proof:  Let  C(p)  denote  the  label  assigned  to  a  pixel  p,  this  problem  can  be  solved  in 
two  major  steps.  In  the  first  step,  we  compute  for  each  pixel  p,  a  closest  neighbor  q 
such  that  C(q)  /-  £(p).  This  can  be  done  by  using  a  simple  modification  of  the  above 
algorithm  for  computing  closest  neighbors.  In  the  second  step,  the  nearest  figure  to  each 
figure  is  computed.  This  can  be  done  by  using  a  slight  variation  of  the  component  labeling 
algorithm.  In  addition  to  keeping  track  of  the  minimum  indexed  pixel  in  the  component, 
we  also  keep  track  of  the  label  of  the  closest  neighbor  with  the  minimum  distance  from  a 
pixel  in  the  component.  From  theorem  3  we  know  that  this  can  be  done  in  0(n2/p)  time. 
At  the  end  of  the  computation,  each  pixel  labeled  F  knows  the  label  of  the  nearest  figure 
to  F.  □ 
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5  Concluding  Remarks 

Wc  have  presented  several  processor-time  optimal  algorithms  for  digitized  images  on  a 
mcsh-conncctcd  processor  array  having  p  PEs  and  0(n2)  memory.  The  organization  uses 
novel  communication  capabilities  and  can  be  implemented  in  current  VLSI  technology. 
The  proposed  organization  has  several  features  which  distinguish  it  from  other  mesh-based 
parallel  computers.  First,  the  PEs  communicate  through  a  shared  memory  as  well  as  direct 
intcrproccssor  links.  Second,  the  MCM  is,  in  fact,  a  hybrid  of  a  mcsh-conncctcd  array 
and  the  communication-efficient  organization  of  the  basic  modules  (BMs).  This  two-level 
organization  of  the  MCM  implies  two  types  of  parallel  algorithms:  basic-module  algorithms 
and  communication-algorithms.  Balancing  the  size  of  each  BM  and  the  number  of  BMs 
in  the  MCM  results  in  processor-time  optimal  algorithms  over  a  wide  range  of  processor 
counts.  The  image  problems  considered  here  require  global  operations  on  image  pixels,  but 
arc  not  computationally  intensive  (they  require  0(n2)  operations  for  an  n  x  n  image).  To 
provide  optimal  solutions  to  such  problems,  several  efficient  data  movement  operations  have 
been  developed  for  the  MCM.  All  of  these  operations  are  efficient  for  applications  in  which 
the  amount  of  data  in  the  memory  array  is  reduced  (i.e.  less  than  0(n2)),  which  is  the  case 
for  several  of  the  image  problems  that  we  have  considered.  The  MCM  organization  has 
been  considered  to  provide  optimal  speedups  for  several  classes  of  problems  such  as  sorting 

-  •  A  -  J**  •*«*•••  ...  I.-  .  .  .1  m.omU  *  f  4 1 
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